A method and system for evaluating the terrain passability of unmanned ground vehicles in unstructured environments

By using multi-sensor information fusion and neural network inference, elevation maps and accessibility maps are generated, which solves the problem of insufficient adaptability in terrain accessibility assessment in unstructured environments and achieves higher accuracy and robustness.

CN120673207BActive Publication Date: 2026-06-30BEIJING INST OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING INST OF TECH
Filing Date
2025-06-10
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies have limited adaptability to terrain accessibility assessment methods in unstructured environments, making it difficult to accurately assess vehicle accessibility in complex, dynamic, and diverse terrains.

Method used

Employing multi-sensor information fusion and neural network inference, data is collected through cameras and LiDAR, features are extracted using pre-trained models, and feature fusion and spatial transformation are performed to generate elevation maps and accessibility maps. Self-attention mechanisms and U-shaped networks are introduced for multi-task learning.

Benefits of technology

It improves the accuracy and robustness of accessibility judgment in complex environments, and enhances the system's reliability and cross-scenario adaptability when the perceived data is noisy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120673207B_ABST
    Figure CN120673207B_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for terrain accessibility assessment of unmanned ground vehicles in unstructured environments, belonging to the field of terrain accessibility analysis technology. The method includes: acquiring image data and LiDAR point cloud data using data acquisition equipment, and extracting features from the image data and LiDAR point cloud data using a pre-trained model to obtain image feature maps and point cloud feature maps; the data acquisition equipment includes a camera and a LiDAR; fusing the image feature maps and point cloud feature maps to obtain multimodal feature maps; spatially transforming the multimodal feature maps to obtain transformed multimodal feature maps; classifying and calculating the transformed multimodal feature maps to obtain an elevation map and a accessibility map, using the accessibility map as the final output of the terrain accessibility assessment. This invention is applicable to unstructured scenarios with complex terrain types and frequent dynamic environmental changes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of terrain accessibility analysis technology, specifically relating to a terrain accessibility assessment method and system suitable for unmanned ground vehicles in unstructured environments. Background Technology

[0002] Performing tasks in unstructured environments, such as disaster relief, agricultural and forestry operations, mining transportation, and military reconnaissance, has received increasing attention in recent years. These tasks operate in complex and varied environments, often lacking clear road markings or structures, and featuring diverse terrain conditions including gravel, slopes, woodlands, and wetlands. In such scenarios, one of the key elements for achieving autonomous vehicle navigation and path planning is the accurate assessment of terrain accessibility. Accurately identifying areas where vehicles can safely travel is a prerequisite for ensuring the smooth and safe execution of missions. Therefore, researching terrain accessibility assessment technologies suitable for unstructured environments is of great significance for improving the environmental adaptability, mission efficiency, and operational safety of autonomous vehicles.

[0003] Current research on terrain accessibility assessment primarily relies on modeling using environmental perception data acquired by vehicle-mounted sensors (such as LiDAR, RGB-D cameras, and inertial measurement units), followed by geometric or semantic analysis of the modeling results. Environmental models represented in the form of digital elevation maps, point clouds, or semantically segmented images are widely used in the industry. Related methods typically extract terrain roughness and slope indicators from a geometric perspective, or identify different surface types such as ground, grass, rocks, and water bodies from a semantic perspective, thereby assessing the difficulty of passage for operational vehicles in a given area. However, the effectiveness of these methods largely depends on the selected assessment indicators and corresponding threshold settings, and their adaptability to the complex, dynamic, and diverse terrains in real-world environments is relatively limited.

[0004] Therefore, there is an urgent need for a terrain accessibility assessment method that can more comprehensively integrate multi-source information and adapt to environmental changes, so as to improve the accessibility judgment ability of autonomous vehicles in complex mission scenarios. Summary of the Invention

[0005] This invention addresses the technical problem of how to assess the cost and risk of unmanned ground vehicles traversing the surrounding terrain and ultimately generate a passability map by fusing multi-sensor information and neural network inference when driving in unstructured environments. It proposes a terrain passability assessment method and system suitable for unmanned ground vehicles in unstructured environments.

[0006] To achieve the above objectives, the present invention provides the following solution: a terrain accessibility assessment method for unmanned ground vehicles in unstructured environments, comprising the following steps:

[0007] S1. Based on the data acquisition device, image data and lidar point cloud data are acquired, and a pre-trained model is used to extract features from the image data and the lidar point cloud data to obtain image feature maps and point cloud feature maps; the data acquisition device includes: a camera and a lidar.

[0008] S2. Perform feature fusion on the image feature map and the point cloud feature map to obtain a multimodal feature map;

[0009] S3. Perform spatial transformation on the multimodal feature map to obtain the transformed multimodal feature map;

[0010] S4. Classify and calculate the transformed multimodal feature map to obtain an elevation map and a accessibility map. Use the accessibility map as the output result of the final terrain accessibility assessment.

[0011] More preferably, S2 includes the following steps:

[0012] S21. Perform a convolution operation on the image feature map and the point cloud feature map to obtain a first three-dimensional feature tensor and a second three-dimensional feature tensor.

[0013] S22. The first three-dimensional feature tensor and the second three-dimensional feature tensor are concatenated to obtain a fused feature map;

[0014] S23. Perform depth processing on the fused feature map to obtain the multimodal feature map.

[0015] More preferably, the method for performing depth processing on the fused feature map to obtain the multimodal feature map includes:

[0016] Q = F fused W Q K = F fused W K V = F fused W V ,

[0017]

[0018] In the formula, This indicates that feature maps F will be fused. fused The set of query, key, and value vectors obtained after flattening, W Q W K W V d represents a learnable linear projection matrix; d represents the attention dimension; H and W represent the fused feature maps F. fused Height and width; F SA This represents a multimodal feature map.

[0019] More preferably, S3 includes the following steps:

[0020] S31. Construct a voxel mesh in a three-dimensional world coordinate system, and based on the camera's intrinsic parameter matrix and the lidar's extrinsic parameter matrix, project the three-dimensional center coordinates of each voxel onto the image plane to obtain pixel coordinates.

[0021] S32. Sample the multimodal feature map based on the pixel coordinates to obtain voxel features;

[0022] S33. Perform dimensionality reduction processing on the voxel features to obtain the transformed multimodal feature map.

[0023] More preferably, the method for obtaining the elevation map and the accessibility map in S4 includes: inputting the converted multimodal feature map into a U-shaped network for processing to obtain a U-shaped feature map; and inputting the U-shaped feature map into a dual-branch network to obtain the elevation map and the accessibility map.

[0024] More preferably, the U-shaped network includes a downsampling layer, a residual enhancement module, and an upsampling layer;

[0025] The downsampling layer reduces the spatial resolution of the transformed multimodal feature map through continuous convolution and pooling operations;

[0026] The residual enhancement module uses stacked convolutions and short-circuited connection paths;

[0027] The upsampling layer restores the spatial resolution of the transformed multimodal feature map through bilinear interpolation or deconvolution operations to obtain the U-shaped feature map.

[0028] More preferably, the dual-branch network includes a first independent branch and a second independent branch; the first independent branch is used to generate the elevation map and the elevation confidence map; the elevation confidence map is used to measure the confidence level of the prediction results in a specific area;

[0029] The second independent branch is used to generate the accessibility map and the accessibility confidence map; the accessibility confidence map is used to help improve the reliability of decision-making.

[0030] The present invention also provides a terrain accessibility assessment system for unmanned ground vehicles in unstructured environments, comprising: a data processing system, a fusion processing system, a spatial transformation system, and an assessment system;

[0031] The data processing system is used to extract features from the image data and lidar point cloud data acquired by the data acquisition device using a pre-trained model, thereby obtaining image feature maps and point cloud feature maps; the data acquisition device includes: a camera and a lidar.

[0032] The fusion processing system is used to fuse the image feature map and the point cloud feature map to obtain a multimodal feature map;

[0033] The spatial transformation system is used to perform spatial transformation on the multimodal feature map to obtain the transformed multimodal feature map;

[0034] The evaluation system is used to classify and calculate the converted multimodal feature map to obtain an elevation map and a accessibility map, and the accessibility map is used as the output result of the final terrain accessibility evaluation.

[0035] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0036] This invention proposes a multimodal map construction framework that integrates visual and lidar point clouds. Based on the input image and point cloud, it uses neural networks for feature extraction and fusion, and simultaneously predicts and generates elevation maps and accessibility maps around the vehicle. It is suitable for unstructured scenarios with complex terrain, uncertain obstacle distribution, and frequent dynamic changes in the environment.

[0037] This invention employs a multi-task learning method that incorporates uncertainty modeling, simultaneously training two sub-tasks: terrain accessibility assessment and confidence estimation. This allows the model to adaptively allocate the importance of different tasks while sharing feature representations, which helps improve the model's adaptability to blurred, occluded, or interfered areas and enhances the reliability of accessibility analysis in complex environments with high perceived data noise.

[0038] This invention introduces a large-scale visual pre-trained model as an image feature extractor. By utilizing its strong representational ability obtained through training in diverse semantic scenarios, the system can still stably extract discriminative deep features in unfamiliar or unseen environments, thereby improving the stability and accuracy of the passability assessment method in cross-scenario applications and enhancing the generalization ability of overall perception.

[0039] This invention introduces a self-attention mechanism to achieve precise alignment and fusion of key features between images and lidar point clouds, fully explores complementary information in the two perception modes, improves the representational ability of fused features, and thus improves the accuracy of terrain accessibility judgment and environmental adaptability robustness in key areas such as obstacle edges and complex elevation differences. Attached Figure Description

[0040] To more clearly illustrate the technical solution of the present invention, the drawings used in the embodiments are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0041] Figure 1 This is a schematic diagram of a terrain accessibility assessment method for unmanned ground vehicles in unstructured environments, according to an embodiment of the present invention.

[0042] Figure 2 This is an overall framework diagram of a terrain accessibility assessment method for unmanned ground vehicles in unstructured environments, according to an embodiment of the present invention.

[0043] Figure 3 This is a framework diagram of the feature fusion stage in an embodiment of the present invention;

[0044] Figure 4 This is a schematic diagram of the framework for obtaining a mobility map from a multimodal feature map according to an embodiment of the present invention. Detailed Implementation

[0045] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0046] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0047] Example 1:

[0048] like Figure 1 , Figure 2 As shown, this embodiment provides a terrain accessibility assessment method for unmanned ground vehicles in unstructured environments, including the following steps:

[0049] S1. Collect image data and lidar point cloud data based on data acquisition equipment, and use a pre-trained model to extract features from the image data and lidar point cloud data to obtain image feature maps and point cloud feature maps; the data acquisition equipment includes: camera and lidar.

[0050] The input data in this embodiment relies on cameras and LiDAR installed on the vehicle, forming the basis for subsequent multimodal feature construction and map generation. The camera's installation position should ensure its field of view covers the vehicle's direction of travel, and its imaging range should include the road surface ahead and the space above ground where obstacles may exist, to fully acquire environmental information. The LiDAR uses a rotating structure and is horizontally mounted at the highest point of the vehicle body, ensuring 360-degree field of view coverage and enhancing the globality and accuracy of obstacle detection. To meet the requirements of 3D structure perception, the LiDAR beam should have no fewer than 32 lines, thereby improving the integrity and resolution of the point cloud data.

[0051] The collected image data and LiDAR point cloud data are then input into pre-trained models that have undergone large-scale semantic learning to extract semantic and spatial features. Both image feature maps and LiDAR point cloud feature maps are in three-dimensional tensor form (i.e., feature maps), containing spatial and channel dimension information. The selected pre-trained models are based on a self-supervised learning strategy, trained on large-scale, heterogeneous datasets, and possess good cross-domain generalization ability and rich semantic representation capabilities. For example, the DINOv2 model can be used to extract image features; this model obtains robust and structured visual semantic information through knowledge distillation and self-attention mechanisms. For LiDAR point cloud feature extraction, open-source models such as OpenShape can be used, which achieves effective encoding of high-dimensional spatial structure information through joint alignment learning of large-scale point cloud-text pairing data. These pre-trained models can provide high-quality abstract semantic representations for subsequent multimodal fusion and downstream tasks.

[0052] S2. Perform feature fusion on the image feature map and the point cloud feature map to obtain a multimodal feature map.

[0053] Considering the complementarity of feature maps from different modalities derived from images and point clouds, a self-attention mechanism is employed in the feature fusion stage to enhance the fusion effect. Specifically, S2 includes the following steps:

[0054] S21. For image feature map F img and point cloud feature map F pc Perform convolution operations to unify their spatial resolution, resulting in a first three-dimensional feature tensor and a second three-dimensional feature tensor with the same width and height.

[0055]

[0056] In the formula, Represents the convolution operation; H, W, and C represent F′. img and F′ pc The length, width, and number of channels of the feature map.

[0057] S22. The first three-dimensional feature tensor and the second three-dimensional feature tensor are concatenated to obtain a fused feature map containing multimodal information.

[0058]

[0059] In the formula, Concat represents the splicing process along the channel dimension.

[0060] S23. A feature fusion network based on a self-attention mechanism is introduced to perform deep processing on the fused feature map to capture long-distance dependencies and contextual associations between features of different modalities, thereby improving the information complementarity effect and obtaining a multimodal feature map with stronger semantic expressive power. For example... Figure 3 As shown.

[0061] Q = F fused W Q K = F fused W K V=F fused W V (3)

[0062]

[0063] In the formula, This indicates that feature maps F will be fused. fused The set of query, key, and value vectors obtained after flattening, W Q W K W V d represents a learnable linear projection matrix; d represents the attention dimension; H and W represent the fused feature maps F. fused Height and width; F SA This represents a multimodal feature map.

[0064] S3. Perform spatial transformation on the multimodal feature map to obtain the transformed multimodal feature map.

[0065] Further implementation involves S3 including the following steps:

[0066] S31. Construct a voxel mesh in a three-dimensional world coordinate system, and based on the camera's intrinsic parameter matrix and the lidar's extrinsic parameter matrix, project the three-dimensional center coordinates of each voxel onto the image plane to obtain pixel coordinates.

[0067] The fused feature map, processed by a self-attention mechanism, is transformed from pixel space to BEV (Bird's EyeView) space to improve the efficiency of terrain understanding and environmental analysis. The specific process includes the following steps: First, a regular voxel mesh is constructed in a 3D world coordinate system. Combining the camera's intrinsic parameter matrix and the LiDAR's extrinsic parameter matrix, the 3D center coordinates of each voxel are projected onto the image plane to obtain the corresponding pixel coordinates. The relevant projection relationships are as follows:

[0068]

[0069] In the formula, K represents the camera intrinsic parameter matrix; T represents the extrinsic parameter matrix from the camera to the LiDAR; (u,v,depth) represents the pixel coordinates and depth information; and (x,y,z) represents the three-dimensional coordinates of the voxel.

[0070] S32. The multimodal feature map is sampled based on pixel coordinates to obtain voxel features.

[0071] Based on the above mapping relationship, sampling is performed in the multimodal feature map to "lift" the two-dimensional pixel features to the three-dimensional voxel space, thereby realizing the transformation from image frame to stereo representation:

[0072] T voxel (x,y,z)=S(F EEV ,(u(x,y,z),v(x,y,z))), (7)

[0073] In the formula, S represents the bilinear interpolation sampling function; (u(x,y,z),v(x,y,z)) represents the voxel T with coordinates (x,y,z). voxel The pixel coordinates obtained after projection, F BEV This represents the transformed multimodal feature map.

[0074] S33. Perform dimensionality reduction on the voxel features to obtain the transformed multimodal feature map.

[0075] Finally, 3D convolutional operations are used to perform structural modeling of voxel features, mining their local spatial relationships. Then, dimensionality reduction is applied to compress the 3D features into a 2D plane, completing the feature transformation to BEV space. This step lays the spatial prior foundation for subsequent ground-referenced environment understanding tasks. The transformation method is as follows:

[0076]

[0077] In the formula, Agg represents a 3D convolution operation. z This indicates the maximum pooling operation.

[0078] S4. Classify and calculate the transformed multimodal feature map to obtain an elevation map and a accessibility map. Use the accessibility map as the output of the final terrain accessibility assessment.

[0079] Further implementation lies in, such as Figure 4 As shown, the method for obtaining the elevation map and the accessibility map includes: inputting the converted multimodal feature map into a U-shaped network for processing to obtain a U-shaped feature map; and inputting the U-shaped feature map into a two-branch network to obtain the elevation map and the accessibility map.

[0080] Specifically, the two-dimensional feature map (the transformed multimodal feature map) constructed in the BEV space will be further processed through a U-Net structure to obtain a U-shaped feature map; this enables comprehensive perception and modeling of multi-scale semantic information and spatial details in the terrain environment. The U-Net architecture consists of three parts: a downsampling layer, a residual enhancement module, and an upsampling layer. The downsampling layer gradually reduces the spatial resolution of the feature map through continuous convolution and pooling operations, extracting higher-level abstract features at the semantic level. The residual module strengthens gradient flow and improves feature propagation and generalization capabilities through stacked convolutions and skip connections. The upsampling layer gradually restores the spatial resolution through bilinear interpolation or deconvolution operations, and introduces low-order detail features from the downsampling symmetric layer, thereby achieving layer-by-layer restoration and multi-scale fusion of spatial information. This network structure can effectively model terrain features at different scales in the environment, providing stable feature support for elevation estimation and drivability analysis.

[0081] This embodiment ultimately outputs two types of environmental understanding results: an elevation map and a accessibility map, which correspond to the modeling and evaluation of terrain geometry and accessibility conditions, respectively. At the same time, a relevant confidence map is generated to indicate the reliability of the prediction.

[0082] A further implementation involves inputting the U-shaped feature map into a dual-branch network, which includes a first independent branch and a second independent branch. The first independent branch generates an elevation map and an elevation confidence map. The elevation map provides detailed height information for each ground grid cell. The elevation confidence map measures the confidence level and uncertainty of the prediction results in a specific area. The second independent branch focuses on accessibility assessment, generating a accessibility map and a accessibility confidence map. The accessibility map evaluates whether the corresponding area is passable (e.g., driving, walking), and simultaneously generates the corresponding accessibility confidence map, helping to improve the robustness of the system and the reliability of task-driven decisions. This output structure can be easily integrated into autonomous driving systems, high-precision map construction, or outdoor robot path planning modules, providing rich data support for environmental cognition and navigation. The accessibility confidence map is used to help improve the robustness of the system and the reliability of task-driven decisions.

[0083] Example 2:

[0084] This embodiment provides a terrain accessibility assessment system suitable for unmanned ground vehicles in unstructured environments, including: a data processing system, a fusion processing system, a spatial transformation system, and an assessment system. The data processing system is used to extract features from image data and LiDAR point cloud data collected by data acquisition devices using a pre-trained model, resulting in image feature maps and point cloud feature maps. The data acquisition devices include: a camera and a LiDAR. The fusion processing system is used to fuse the image feature maps and point cloud feature maps to obtain multimodal feature maps. The spatial transformation system is used to perform spatial transformation on the multimodal feature maps to obtain transformed multimodal feature maps. The assessment system is used to classify and calculate the transformed multimodal feature maps to obtain an elevation map and a accessibility map, with the accessibility map serving as the final output of the terrain accessibility assessment.

[0085] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made by those skilled in the art to the technical solutions of the present invention without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims

1. A terrain accessibility assessment method for unmanned ground vehicles in unstructured environments, characterized in that, Includes the following steps: S1. Collect image data and lidar point cloud data based on data acquisition equipment, and use a pre-trained model to extract features from the image data and lidar point cloud data to obtain image feature maps and point cloud feature maps. The data acquisition equipment includes: a camera and a lidar; S2. Perform feature fusion on the image feature map and the point cloud feature map to obtain a multimodal feature map; S3. Perform spatial transformation on the multimodal feature map to obtain the transformed multimodal feature map; specifically: S31. Construct a voxel mesh in a three-dimensional world coordinate system, and based on the camera's intrinsic parameter matrix and the lidar's extrinsic parameter matrix, project the three-dimensional center coordinates of each voxel onto the image plane to obtain pixel coordinates. S32. Sample the multimodal feature map based on the pixel coordinates to obtain voxel features; S33. Perform dimensionality reduction processing on the voxel features to obtain the transformed multimodal feature map; S4. Classify and calculate the transformed multimodal feature map to obtain an elevation map and a accessibility map, using the accessibility map as the output of the final terrain accessibility assessment; wherein, the method for obtaining the elevation map and the accessibility map includes: inputting the transformed multimodal feature map into a U-shaped network for processing to obtain a U-shaped feature map; inputting the U-shaped feature map into a two-branch network to obtain the elevation map and the accessibility map; the two-branch network includes a first independent branch and a second independent branch; the first independent branch is used to generate the elevation map and the elevation confidence map; the elevation confidence map is used to measure the confidence level of the prediction result in a specific area; the second independent branch is used to generate the accessibility map and the accessibility confidence map; the accessibility confidence map is used to help improve the reliability of decision-making.

2. The terrain accessibility assessment method for unmanned ground vehicles in unstructured environments according to claim 1, characterized in that, S2 includes the following steps: S21. Perform a convolution operation on the image feature map and the point cloud feature map to obtain a first three-dimensional feature tensor and a second three-dimensional feature tensor. S22. The first three-dimensional feature tensor and the second three-dimensional feature tensor are concatenated to obtain a fused feature map; S23. Perform depth processing on the fused feature map to obtain the multimodal feature map.

3. The terrain accessibility assessment method for unmanned ground vehicles in unstructured environments according to claim 2, characterized in that, The method for performing depth processing on the fused feature map to obtain the multimodal feature map includes: , , , In the formula, This indicates that feature maps will be fused. The resulting set of query, key, and value vectors after flattening It is a learnable linear projection matrix; d Indicates the attention dimension; H , W Represents the fused feature map Height and width; F SA represents a multimodal feature map.

4. The terrain accessibility assessment method for unmanned ground vehicles in unstructured environments according to claim 1, characterized in that, The U-shaped network includes a downsampling layer, a residual enhancement module, and an upsampling layer; The downsampling layer reduces the spatial resolution of the transformed multimodal feature map through continuous convolution and pooling operations; The residual enhancement module uses stacked convolutions and short-circuited connection paths; The upsampling layer restores the spatial resolution of the transformed multimodal feature map through bilinear interpolation or deconvolution operations to obtain the U-shaped feature map.

5. A terrain accessibility assessment system for unmanned ground vehicles in unstructured environments, the system being used to implement the method as described in any one of claims 1-4, characterized in that, include: Data processing system, fusion processing system, spatial transformation system, and evaluation system; The data processing system is used to extract features from the image data and lidar point cloud data acquired by the data acquisition device using a pre-trained model, so as to obtain image feature maps and point cloud feature maps. The data acquisition equipment includes: a camera and a lidar; The fusion processing system is used to fuse the image feature map and the point cloud feature map to obtain a multimodal feature map; The spatial transformation system is used to perform spatial transformation on the multimodal feature map to obtain the transformed multimodal feature map; The evaluation system is used to classify and calculate the converted multimodal feature map to obtain an elevation map and a accessibility map, and the accessibility map is used as the output result of the final terrain accessibility evaluation.