Mountain Environment Perception System and Method Based on Multi-Sensor Fusion
By introducing parallel data quality assessment and cross-modal inconsistency measurement methods, and dynamically adjusting the sensor feature fusion weights, the problem of perception failure caused by sudden changes in sensor data quality in mountainous environments is solved, and stable and reliable environmental perception is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JINHUA POWER TRANSMISSION & DISTRIBUTION ENG
- Filing Date
- 2026-01-28
- Publication Date
- 2026-06-30
AI Technical Summary
Existing multi-sensor fusion sensing methods cannot assess and respond to sudden changes in sensor data quality in real time in mountainous environments, leading to reduced reliability or even failure of the sensing system.
A parallel real-time data quality assessment module is introduced. By independently analyzing the internal data characteristics of each sensor and cross-modal inconsistency measures, the sensor confidence score is evaluated. Based on confidence modulation, adaptive feature fusion is performed to dynamically adjust the feature contribution weight, suppress unreliable information, and focus on adopting high-quality information.
Ensuring stable and reliable perception results under extreme challenges enhances the safety and robustness of unmanned systems in mountainous environments.
Smart Images

Figure CN121600412B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing, and more specifically, to a mountain environment perception system and method based on multi-sensor fusion. Background Technology
[0002] With the development of automation technology, unmanned systems equipped with autonomous perception and decision-making capabilities, such as autonomous vehicles, exploration robots, and rescue drones, are increasingly being applied in unstructured and complex environments. Among these, mountainous environments, due to their rugged terrain, variable road conditions, lack of artificial markers, and harsh and unpredictable weather conditions, place extremely high demands on the environmental perception capabilities of unmanned systems. To ensure the safe and efficient operation of unmanned systems in mountainous areas, they must be able to understand the surrounding three-dimensional spatial structure in real time and accurately, identify potential obstacles (such as rolling stones, gullies, and fallen trees), and assess the drivability of the terrain.
[0003] Single-type sensors often fall short in addressing such complex challenges. For example, cameras provide rich color and texture information, aiding in object classification, but they are extremely sensitive to changes in lighting and adverse weather conditions (such as dense fog, heavy rain, and darkness), and struggle to directly acquire accurate 3D depth information. In contrast, lidar can directly output high-precision 3D point cloud data, accurately depicting the geometric contours of the environment, and is less affected by lighting changes. However, its performance significantly degrades in rain, snow, and fog due to laser beam scattering and attenuation, and the generated point cloud data is typically sparse and lacks texture information. Therefore, fusing data from multiple sensors, such as cameras and lidar, to achieve complementary advantages and build an all-weather, highly robust environmental perception system has become a recognized technological development direction in this field.
[0004] However, existing multi-sensor fusion methods still face significant technical bottlenecks when applied to harsh mountainous environments. Current mainstream fusion frameworks, whether pre-fusion schemes that fuse raw data at the input or deep fusion schemes that interact in the intermediate feature extraction layer, often implicitly assume that all sensors can provide effective, even noisy, information at any given time. The fusion strategies of these models (e.g., fusion weights learned through attention mechanisms) tend to be fixed after training or can only undergo limited adaptive adjustments based on the data content. This static or slowly adaptive fusion mechanism cannot effectively cope with the catastrophic changes in sensor data quality that may occur in mountainous environments. For example, when a vehicle exits a tunnel and encounters strong sunlight glare, the camera's image data may instantly become heavily overexposed, transforming its information from low quality to highly misleading; or in a sudden localized dense fog, the point cloud from a lidar system may change from a realistic reflection of the environment to a false wall of points. In such cases, existing fusion models, lacking a mechanism to judge the real-time reliability of input data, will still blindly include these erroneous and harmful information into the fusion calculation, causing modules such as the attention mechanism to be deceived, thereby seriously polluting the final fusion features and outputting incorrect perception results, posing a huge threat to the security of the system.
[0005] Therefore, an optimized mountain environment perception scheme based on multi-sensor fusion is desired. Summary of the Invention
[0006] To address the aforementioned technical problems, this application is proposed. Embodiments of this application provide a mountain environment perception system and method based on multi-sensor fusion.
[0007] According to one aspect of this application, a method for mountain environment perception based on multi-sensor fusion is provided, comprising:
[0008] Acquire raw image data and raw lidar data of the target mountain environment;
[0009] Parallel real-time data quality assessments are performed on the raw image data and raw LiDAR data to obtain the confidence scores of the first and second sensors.
[0010] Multimodal feature extraction is performed on the original image data and the original LiDAR data to obtain image feature maps and point cloud feature maps;
[0011] Based on the confidence scores of the first and second sensors, adaptive feature fusion based on confidence modulation is performed on the image feature map and the point cloud feature map to obtain a multimodal fusion feature map of the mountain environment.
[0012] The multimodal fusion feature map of the mountain environment is input into the environmental perception task head to obtain the environmental perception results.
[0013] According to another aspect of this application, a mountain environment perception system based on multi-sensor fusion is provided, comprising:
[0014] The raw data acquisition module is used to acquire raw image data and raw lidar data of the target mountain environment;
[0015] The real-time data quality assessment module is used to perform parallel real-time data quality assessment on the raw image data and raw LiDAR data to obtain the first sensor confidence score and the second sensor confidence score.
[0016] The multimodal feature extraction module is used to extract multimodal features from the original image data and the original LiDAR data to obtain image feature maps and point cloud feature maps;
[0017] The mountain environment multimodal feature fusion module is used to perform adaptive feature fusion based on confidence modulation on image feature map and point cloud feature map based on the confidence scores of the first sensor and the second sensor to obtain a mountain environment multimodal fusion feature map.
[0018] The environment perception module is used to input the multimodal fusion feature map of the mountain environment into the environment perception task head to obtain the environment perception results.
[0019] Compared with existing technologies, this application provides a mountain environment perception system and method based on multi-sensor fusion. To address the problem of perception failure when sensor data quality changes abruptly, existing technologies introduce a parallel real-time data quality assessment module as a pre-judgment for feature fusion. This module not only independently analyzes the internal data features of each sensor to assess its inherent quality but also introduces a cross-modal inconsistency measure. By comparing the geometric understanding of the same scene from different sensors, it determines whether there are conflicts or contradictions between the data. Based on the dynamic confidence score obtained from the assessment, a confidence-modulated adaptive feature fusion mechanism is further constructed. This mechanism uses confidence as a key modulation factor to dynamically adjust the contribution weight of each sensor feature during the fusion process. Therefore, when the information provided by any sensor is unreliable, it actively suppresses its influence and prioritizes the adoption of information sources with high credibility. This ensures that the system can still output stable and reliable perception results when facing extreme challenges such as glare and dense fog, fundamentally improving the safety and robustness of unmanned systems in mountainous environments. Attached Figure Description
[0020] The above and other objects, features, and advantages of this application will become more apparent from the more detailed description of the embodiments of this application in conjunction with the accompanying drawings. The drawings are provided to further illustrate the embodiments of this application and form part of the specification. They are used together with the embodiments of this application to explain this application and do not constitute a limitation thereof. In the drawings, the same reference numerals generally represent the same components or steps.
[0021] Figure 1 This is a flowchart of a mountain environment perception method based on multi-sensor fusion according to an embodiment of this application;
[0022] Figure 2 This is a schematic diagram of data flow in a mountain environment perception method based on multi-sensor fusion according to an embodiment of this application;
[0023] Figure 3 This is a flowchart illustrating a multi-sensor fusion-based mountain environment perception method according to an embodiment of this application, which performs parallel real-time data quality assessment of raw image data and raw lidar data to obtain a first sensor confidence score and a second sensor confidence score.
[0024] Figure 4 This is a flowchart illustrating the process of obtaining a cross-modal inconsistency score by performing a cross-modal prediction-based inconsistency measurement on raw image data and raw lidar data according to the multi-sensor fusion-based mountain environment perception method of this application.
[0025] Figure 5 This is a flowchart illustrating the process of adaptive feature fusion based on confidence modulation of image feature maps and point cloud feature maps to obtain a multimodal fusion feature map of a mountain environment based on the confidence scores of the first and second sensors, according to an embodiment of the present application.
[0026] Figure 6 This is a block diagram of a mountain environment perception system based on multi-sensor fusion according to an embodiment of this application. Detailed Implementation
[0027] Hereinafter, exemplary embodiments according to this application will be described in detail with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of this application, and not all embodiments of this application. It should be understood that this application is not limited to the exemplary embodiments described herein.
[0028] As indicated in this application and claims, unless the context clearly indicates otherwise, the words "a," "an," "an," and / or "the" are not specifically singular and may include plural forms. Generally speaking, the terms "comprising" and "including" only indicate the inclusion of explicitly identified steps and elements, which do not constitute an exclusive list, and the method or apparatus may also include other steps or elements.
[0029] While this application makes various references to certain modules of the systems according to embodiments of this application, any number of different modules can be used and run on user terminals and / or servers. The modules described are merely illustrative, and different aspects of the systems and methods may use different modules.
[0030] Flowcharts are used in this application to illustrate the operations performed by the system according to embodiments of this application. It should be understood that the preceding or following operations are not necessarily performed in exact order. Instead, various steps can be processed in reverse order or simultaneously as needed. Furthermore, other operations can be added to these processes, or one or more steps can be removed from them.
[0031] Hereinafter, exemplary embodiments according to this application will be described in detail with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of this application, and not all embodiments of this application. It should be understood that this application is not limited to the exemplary embodiments described herein.
[0032] To address the technical problem of reduced reliability or even failure of existing sensing systems in complex environments such as mountainous terrain due to the inability to assess and respond in real-time to sudden changes in sensor data quality (such as camera glare and dense fog in lidar), this application proposes a multi-sensor fusion-based method for mountainous environment perception. This method first performs parallel real-time data quality assessment on the acquired raw images and lidar data. This assessment process employs a two-pronged approach: on the one hand, it extracts the single-modal quality feature vectors of each sensor using a lightweight network to determine its inherent data quality; on the other hand, it introduces a non-consistency metric based on cross-modal prediction, such as comparing the depth map predicted from the image with the actual depth map from the lidar, to quantify the degree of conflict between different data sources. Combining these two aspects, the system calculates accurate first and second sensor confidence scores for the camera and lidar, respectively. Subsequently, after multi-modal feature extraction, this method performs adaptive feature fusion based on confidence modulation. Instead of statically or blindly fusing all features, it uses the aforementioned confidence scores as dynamic weights to modulate the intensity of feature interactions in real-time within a cross-modal attention mechanism. When the confidence score of a sensor decreases, the influence of its features in the fusion process is significantly weakened, enabling the system to intelligently ignore unreliable information and focus on adopting high-quality data sources. Ultimately, by feeding the reliably fused multimodal feature map of the mountain environment into the task header, this method ensures stable and accurate environmental perception results even in extreme cases where some sensor performance drastically degrades, thereby fundamentally improving the system's robustness and security.
[0033] Figure 1 This is a flowchart of a mountain environment perception method based on multi-sensor fusion according to an embodiment of this application. Figure 2 This is a schematic diagram of data flow in a mountain environment perception method based on multi-sensor fusion according to an embodiment of this application. Figure 1 and Figure 2 As shown, the mountain environment perception method based on multi-sensor fusion according to an embodiment of this application includes the following steps: S100, acquiring raw image data and raw lidar data of the target mountain environment; S200, performing parallel real-time data quality assessment on the raw image data and raw lidar data to obtain a first sensor confidence score and a second sensor confidence score; S300, performing multimodal feature extraction on the raw image data and raw lidar data to obtain an image feature map and a point cloud feature map; S400, performing adaptive feature fusion based on confidence modulation on the image feature map and the point cloud feature map based on the first sensor confidence score and the second sensor confidence score to obtain a multimodal fusion feature map of the mountain environment; S500, inputting the multimodal fusion feature map of the mountain environment into the environment perception task head to obtain the environment perception result.
[0034] Specifically, in step S100, raw image data and raw LiDAR data of the target mountainous environment are acquired. It should be understood that single sensors have inherent limitations in dealing with the complexity and variability of mountainous environments. For example, image sensors are susceptible to lighting and weather conditions and lack precise depth information, while LiDAR sensors, although providing precise geometric structures, struggle to distinguish object materials and experience performance degradation in adverse weather conditions. Therefore, in the technical solution of this application, raw image data and raw LiDAR data of the target mountainous environment are acquired to provide complementary multimodal raw data input for subsequent environmental perception tasks. This ensures that the system possesses dual information sources from the outset: two-dimensional visual data containing rich texture and color information, and high-precision three-dimensional spatial geometric data. This lays a solid data foundation for overcoming the perception blind spots and failure risks of single sensors in specific scenarios through fusion analysis.
[0035] More specifically, in a concrete example of this application, the process of acquiring raw image data and raw LiDAR data of the target mountainous environment is achieved through an integrated vehicle-mounted multi-sensor system. First, a forward-looking high-resolution digital camera and a multi-beam LiDAR sensor are rigidly mounted and integrated on an autonomous vehicle or mobile robot platform. Second, precise time synchronization is performed between the camera and LiDAR to ensure that each frame of data acquired by the two sensors corresponds to the environmental state at the same moment. This is accomplished through a unified hardware trigger signal or a software synchronization mechanism based on a network time protocol. Third, during vehicle operation, the camera continuously captures two-dimensional color images of the mountainous scene ahead at a preset frame rate, forming a raw image data stream; simultaneously, the LiDAR continuously emits laser pulses and receives echoes through rotating scanning, generating a dense three-dimensional point cloud containing the three-dimensional coordinates and reflection intensity information of each point, forming a raw LiDAR data stream. Fourth, the acquired raw image data and raw LiDAR data are transmitted in real time to the onboard computing unit as the initial input for all subsequent perception algorithm processing.
[0036] Specifically, in step S200, parallel real-time data quality assessments are performed on the original image data and the original LiDAR data to obtain a first sensor confidence score and a second sensor confidence score. It should be understood that, due to the lack of an online evaluation mechanism for the real-time reliability of input data in existing multi-sensor fusion methods, when faced with a sudden drop in sensor data quality caused by sudden changes in lighting or severe weather in mountainous environments, misleading erroneous information may be introduced into the fusion process, severely affecting the accuracy of the perception results and the security of the system. Therefore, in the technical solution of this application, parallel real-time data quality assessments are further performed on the original image data and the original LiDAR data to obtain a first sensor confidence score and a second sensor confidence score. This provides a quantitative, dynamic, and characterizing assessment signal for the subsequent adaptive feature fusion process, representing the current reliability of information from each sensor. This allows the perception system to know the reliability of each data source before feature fusion, providing a decision-making basis for intelligently suppressing or isolating the harmful effects of failed sensors and prioritizing the adoption of high-quality information, thereby fundamentally ensuring the robustness of the fusion results.
[0037] Figure 3 This is a flowchart illustrating a multi-sensor fusion-based mountain environment perception method according to an embodiment of this application, which performs parallel real-time data quality assessment of raw image data and raw LiDAR data to obtain a first sensor confidence score and a second sensor confidence score. Figure 3 As shown, step S200 includes: S210, extracting camera modal quality feature vectors from the original image data; S220, extracting lidar modal quality feature vectors from the original lidar data; S230, performing a cross-modal prediction-based inconsistency measure on the original image data and the original lidar data to obtain a cross-modal inconsistency score; S240, calculating the first sensor confidence score and the second sensor confidence score based on the cross-modal inconsistency score, the camera modal quality feature vector, and the lidar modal quality feature vector.
[0038] Accordingly, in steps S210 and S220, camera modal quality feature vectors are extracted from the original image data, and lidar modal quality feature vectors are extracted from the original lidar data. It should be understood that relying solely on external comparisons between different sensor data is insufficient to comprehensively assess data quality, as the intrinsic state of the sensor itself (such as lens blur and sensor noise) is also a key factor determining the reliability of its information, and such intrinsic quality issues may not immediately manifest in cross-modal comparisons. Therefore, in the technical solution of this application, camera modal quality feature vectors are further extracted from the original image data, and lidar modal quality feature vectors are extracted from the original lidar data. This allows for independent and parallel intrinsic quality checks on the data streams of each sensor, quantifying them into structured feature vectors. This provides direct evidence about the sensor's own operating state for subsequent confidence calculations, complementing the cross-modal inconsistency scores, thereby constructing a more comprehensive and robust quality assessment system.
[0039] Specifically, in this embodiment, extracting camera modality quality feature vectors from raw image data includes: inputting the raw image data into a lightweight convolutional neural network model to obtain camera modality quality feature vectors; extracting lidar modality quality feature vectors from raw lidar data includes: inputting the raw lidar data into a point cloud quality feature extractor based on a sparse convolutional architecture to obtain the lidar modality quality feature vectors. More specifically, for camera modality, the system inputs the acquired raw image data into a pre-trained lightweight convolutional neural network model. This model is designed to analyze the low-level statistical characteristics of images. It extracts and quantifies the brightness distribution of the image through convolutional layers to determine whether there is overexposure or underexposure, analyzes the gradient information of the image to evaluate its sharpness and whether there is motion blur, and identifies specific visual artifacts caused by weather conditions such as rain, snow, and fog. Finally, it integrates these analysis results into a fixed-dimensional camera modality quality feature vector through a global pooling layer. Secondly, for lidar modality, the system inputs the acquired raw lidar data into a point cloud quality feature extractor based on a sparse convolutional architecture. The extractor first converts the sparse point cloud data into voxels, and then efficiently processes these voxels through a sparse 3D convolutional network to analyze the density distribution of the point cloud in space, the statistical characteristics of the laser return intensity value, and the number of isolated noise points. These indicators can effectively reflect whether the lidar is affected by environmental interference (such as dust, precipitation) or its own performance degradation. Finally, the extractor outputs a lidar modal quality feature vector that characterizes the intrinsic data quality of the point cloud.
[0040] Accordingly, in step S230, a cross-modal prediction-based inconsistency measure is performed on the original image data and the original LiDAR data to obtain a cross-modal inconsistency score. It should be understood that because independent internal quality assessments of the sensors alone cannot detect inconsistencies in perception results caused by calibration deviations between sensors or by a sensor being misled by specific environmental phenomena (such as water reflections or transparent glass), such deep-seated conflicts pose a potential threat to system security. Therefore, in the technical solution of this application, a cross-modal prediction-based inconsistency measure is further performed on the original image data and the original LiDAR data to obtain a cross-modal inconsistency score, thereby establishing a cross-sensor data verification and arbitration mechanism. By quantifying the differences in the understanding of the geometric structure of the same scene from different data sources, potential perception conflicts can be discovered. This provides strong external evidence for the final confidence assessment, enabling the system to identify and quantify more subtle perception data contradictions that cannot be detected through internal checks alone, thus significantly improving the accuracy and comprehensiveness of the quality assessment.
[0041] Figure 4 This is a flowchart illustrating the process of performing a cross-modal prediction-based inconsistency measure on raw image data and raw LiDAR data, according to an embodiment of this application, to obtain a cross-modal inconsistency score. Figure 4 As shown, step S230 includes: S231, inputting the original image data into the monocular depth estimation model to obtain the predicted depth map; S232, projecting the original lidar data onto the image coordinate system through the calibration matrix to obtain the sparse lidar depth map; S233, performing a non-consistency measure on the predicted depth map and the sparse lidar depth map to obtain the cross-modal non-consistency score.
[0042] Specifically, in step S231, the original image data is input into a monocular depth estimation model to obtain a predicted depth map. It should be understood that since the original image data exists in the form of color and brightness information, while the LiDAR data exists in the form of three-dimensional spatial coordinate points, the two belong to different data modalities and cannot be directly compared in terms of geometric consistency. Therefore, in the technical solution of this application, the original image data is further input into a monocular depth estimation model to obtain a predicted depth map, thereby transforming the two-dimensional visual information into a dense, pixel-level inference result of the scene's three-dimensional geometric structure. This provides an intermediate data representation that can be directly aligned and compared with the LiDAR data in the geometric dimension for subsequent cross-modal inconsistency measurement steps, making it possible to quantify the differences in scene understanding between the two sensors.
[0043] More specifically, in a concrete example of this application, the generation process of the predicted depth map is executed in real time on the vehicle-mounted computing unit. First, the system invokes a lightweight monocular depth estimation neural network model with an encoder-decoder architecture, pre-trained on a large driving scene dataset. Second, the raw image data of the current frame acquired from the vehicle-mounted camera is preprocessed, including resizing it to the input resolution required by the model and performing numerical normalization. Third, the preprocessed image tensor is fed into the depth estimation model for a forward propagation calculation. Fourth, the model outputs a single-channel two-dimensional tensor, the size of which corresponds to the input image. Each element in this tensor represents the predicted depth of the corresponding pixel in the image. This two-dimensional tensor is the required predicted depth map and is passed to the subsequent inconsistency measurement process.
[0044] Specifically, in step S232, the original LiDAR data is projected onto the image coordinate system using a calibration matrix to obtain a sparse LiDAR depth map. It should be understood that since the original LiDAR data exists as a three-dimensional point cloud in its own coordinate system, while the predicted depth map generated in the previous step is a dense representation in a two-dimensional image coordinate system, the two are different in coordinate space and data structure, and cannot be directly compared numerically. Therefore, in the technical solution of this application, the original LiDAR data is further projected onto the image coordinate system using a calibration matrix to obtain a sparse LiDAR depth map, thereby transforming the discrete three-dimensional point cloud into a two-dimensional sparse depth representation with the same data dimension in the same coordinate system as the predicted depth map. This provides a geometric benchmark with a true scale, allowing for direct pixel-level alignment and comparison, for subsequent inconsistency measurement steps, thus making it possible to quantify the geometrically perceived differences between the two modalities.
[0045] More specifically, in a concrete example of this application, the generation process of the sparse LiDAR depth map is executed in real time on the vehicle-mounted computing unit. First, the system loads pre-calibrated offline sensor extrinsic parameters from storage. These extrinsic parameters include a calibration matrix describing a rigid transformation from the LiDAR coordinate system to the camera coordinate system. Second, the system creates a two-dimensional array with the exact same resolution as the camera image and initializes all its elements to zero or a specific value representing invalid depth. This array will be used to store the generated sparse LiDAR depth map. Third, the system iterates through each three-dimensional point in the raw LiDAR data acquired at the current moment. For each point, the system applies the calibration matrix to transform its three-dimensional coordinates to the camera's three-dimensional coordinate system and further projects it onto the two-dimensional image plane using the camera's intrinsic parameter model, thereby calculating the pixel coordinates (u, v) and its depth value Z in the camera coordinate system corresponding to the three-dimensional point. Fourth, the system checks whether the calculated pixel coordinates (u, v) are within the valid range of the image. If they are within the range, the depth value Z of the point is filled into the aforementioned two-dimensional array at the position of coordinates (u, v). After traversing all the three-dimensional points, the two-dimensional array constitutes the required sparse lidar depth map.
[0046] Specifically, in step S233, the predicted depth map and the sparse LiDAR depth map are subjected to a non-consistency measure to obtain the cross-modal non-consistency score. It should be understood that existing methods for measuring inter-modal non-consistency by the mean absolute error between the predicted depth map and the sparse real depth map have inherent flaws in scenarios with extremely high safety requirements, such as mountain environment perception. This method fails to fully consider the geometric context and depth dependence in the real world when calculating the difference. First, this method is spatially insensitive, treating the depth error of all pixels equally. However, in autonomous driving scenarios, the error of near-distance targets is obviously more critical than that of far-distance targets, and this special relationship of proximity is completely ignored. Second, this method also lacks local structure sensitivity; it calculates the error independently point-by-point, ignoring the important physical characteristics of the geometric continuity of the object surface. It cannot effectively distinguish between a prediction with a small overall offset but correct structure, and a structurally incorrect prediction with a small average error but a noisy surface, leading to a non-consistency score that may contradict the actual reliability of the perception. To address the aforementioned technical shortcomings, an optimized inconsistency measurement mechanism more suited to mountainous environments is proposed. In this application's technical solution, inconsistency measurement is applied to the predicted depth map and the sparse lidar depth map to obtain the cross-modal inconsistency score. This constructs a composite metric that can perceive absolute distance deviations, determine local geometric similarity, and dynamically adjust error weights based on target distance. This produces a cross-modal inconsistency score that is far more reliable and better reflects the complexity of the physical world than the traditional mean absolute error. This score intelligently amplifies the perceived risk of near-range targets while exhibiting higher tolerance for structurally correct benign errors, thus providing a high-quality, high-fidelity adjudication signal for subsequent confidence calculations.
[0047] More specifically, in this embodiment of the application, the non-consistency measurement of the predicted depth map and the sparse lidar depth map to obtain the cross-modal non-consistency score includes: performing local gradient calculation on the predicted depth map and the sparse lidar depth map to obtain the gradient map of the predicted depth map and the gradient map of the sparse lidar depth map; calculating the absolute depth error term between the predicted depth map and the sparse lidar depth map; calculating the local gradient consistency error term between the gradient map of the predicted depth map and the gradient map of the sparse lidar depth map; and performing error aggregation on the absolute depth error term and the local gradient consistency error term based on the depth weighting factor determined by the sparse lidar depth map to obtain the cross-modal non-consistency score.
[0048] Accordingly, local gradient calculations are performed on the predicted depth map and the sparse lidar depth map to obtain the gradient maps of the predicted depth map and the sparse lidar depth map, respectively. It should be understood that since the original depth map data only provides point-by-point distance information and cannot be directly used to characterize key local geometric structures, it is difficult to effectively distinguish between predictions with slight overall offsets but correct structures, and predictions with small average errors but noisy surfaces when performing inconsistency measurements. Therefore, in the technical solution of this application, local gradient calculations are further performed on the predicted depth map and the sparse lidar depth map to obtain the gradient maps of the predicted depth map and the sparse lidar depth map, thereby extracting high-dimensional structural information reflecting local orientation and shape changes of the object's surface from the original point-by-point distance information. In this way, abstract depth values can be transformed into numerical descriptions of scene geometric features. For example, in a mountainous environment, gradient maps can transform abstract depth values into quantitative reflections of specific geometric features such as the steepness of slopes, the angularity of rocks, or the vertical shape of tree trunks. This lays the data foundation for subsequent more refined composite measurements that consider both distance error and structural consistency.
[0049] More specifically, in a concrete example of this application, the gradient map generation process is executed in parallel on the vehicle-mounted computing unit. First, the system receives the predicted depth map and the sparse LiDAR depth map generated in the preceding steps in parallel. Second, for the dense predicted depth map, the system applies the Sobel operator across its entire image domain. Specifically, by convolving the image with a horizontal Sobel convolution kernel and a vertical Sobel convolution kernel respectively, the gradient components of each pixel in the horizontal and vertical directions are obtained. These two components together constitute the two-dimensional gradient vector of that pixel, and the set of gradient vectors of all pixels is the gradient map of the predicted depth map. Third, for the sparse LiDAR depth map, the system performs the same Sobel operator convolution operation only on pixels that have valid depth values within themselves and their neighborhoods to calculate their gradient vectors, thereby generating a gradient map of the sparse LiDAR depth map that is equally sparse and aligned with the original sparse LiDAR depth map. In this way, higher-dimensional structural information can be extracted from the original data, providing two core inputs for subsequent composite metrics: the original depth information and the newly generated geometric structure information.
[0050] Accordingly, an absolute depth error term is calculated between the predicted depth map and the sparse LiDAR depth map; a local gradient consistency error term is calculated between the gradient map of the predicted depth map and the gradient map of the sparse LiDAR depth map; and, based on the depth weighting factor determined by the sparse LiDAR depth map, the absolute depth error term and the local gradient consistency error term are aggregated to obtain the cross-modal inconsistency score. It should be understood that, due to the limitations of a single error metric, neither absolute depth error nor gradient error can fully reflect the inconsistency between multimodal data, particularly the inability to differentiate the risk level of errors based on target distance, potentially leading to measurement results that contradict the perceived reliability. Therefore, in the technical solution of this application, the absolute depth error term between the predicted depth map and the sparse LiDAR depth map is further calculated, and the local gradient consistency error term between the gradient map of the predicted depth map and the gradient map of the sparse LiDAR depth map is calculated. Furthermore, based on the depth weighting factor determined by the sparse LiDAR depth map, the absolute depth error term and the local gradient consistency error term are aggregated to obtain the cross-modal inconsistency score, thereby constructing a novel composite metric that can simultaneously perceive the danger level of distance errors and geometric similarity. Specifically, this process no longer uses simple error summation, but instead calculates the inconsistency contribution of each effective pixel in the sparse LiDAR depth map using a composite loss function, and finally averages the contribution values of all points. This produces a comprehensive, robust, and highly consistent inconsistency metric that meets the real needs of physical scenarios. It can intelligently amplify near-range errors that pose a direct threat to driving safety, while giving higher tolerance to benign errors that only have overall offset but correct local structure, thus providing a high-fidelity decision basis for the final confidence calculation.
[0051] More specifically, in a particular example of this application, the calculation of the cross-modal inconsistency score is performed on an onboard computing unit. First, the system traverses every pixel in the sparse LiDAR depth map that has a valid depth value. For each point The system calculates its absolute depth error term. This is the absolute value of the difference between the depth value of a point in the predicted depth map and the depth value in the sparse lidar depth map. It is expressed by the following formula:
[0052] .
[0053] in, It is a pixel. Absolute depth error at the location; Is it predicting depth map in The depth value of the point, It is a sparse lidar depth map in The depth value of the point. This item preserves the fundamental metric for the accuracy of absolute distances.
[0054] Second, the system calculates the local gradient consistency error term at that point. This is achieved by calculating the cosine similarity between the gradient vector of a point in the gradient map of the predicted depth map and the gradient vector in the gradient map of the sparse lidar depth map, and then subtracting this similarity value from 1. It can be expressed by the following formula:
[0055] .
[0056] in, It is a pixel. Gradient consistency error at the location; and These are the gradient maps of the predicted depth map and the sparse lidar depth map, respectively. The gradient vector of a point; Represents the dot product of vectors; The norm of a vector; This is a tiny constant used to prevent the denominator from being zero. The local gradient consistency error term serves to determine whether the predicted local surface shape matches the actual shape by calculating the cosine similarity between two gradient vectors. For example, even if the predicted depth of a wall has an overall offset, as long as its flatness is correctly predicted (gradient directions are consistent), the error of this term will be very small.
[0057] Third, the system calculates the depth weighting factor for that point. This involves calculating a weight based on the actual depth value of the point in the sparse lidar depth map using an exponential decay function. This weight decreases as the distance increases, thus giving higher importance to the error of points closer to the distance. This can be expressed by the following formula:
[0058] .
[0059] in, It is a pixel. Depth weight at location; It is a natural exponential function; It is a small, normal attenuation coefficient; It is a sparse lidar depth map in The true depth value of the point. This depth weighting factor simulates the high level of attention humans pay to nearby objects in visual or driving decisions. By using an exponential decay function, it assigns higher weight to the error of nearby targets, thus directly linking the measurement results to driving safety risks.
[0060] Fourth, the system performs a weighted sum of the absolute depth error term and the local gradient consistency error term at that point, where the weights are determined by a preset hyperparameter. The result of the summation is then multiplied by the depth weighting factor to obtain the value at that point. The contribution value of each pixel to the overall inconsistency score is calculated. Finally, the system averages the contribution values of all valid pixels to obtain the final cross-modal inconsistency score. Expressed using the following formula:
[0061] .
[0062] in, This represents the final cross-modal inconsistency score; It is the set of effective pixels in the sparse lidar depth map; It is the number of elements in the set; Represents a specific pixel in the set; It is a hyperparameter between 0 and 1, used to balance the weights of absolute error and gradient error.
[0063] Through the aforementioned improved mechanism, a non-consistency metric score is generated that is far more reliable and better reflects the complexity of the physical world than the traditional mean absolute error. By employing deep weighting, this mechanism dynamically focuses computational resources and attention on the near-field space, which is crucial for autonomous driving safety, significantly improving sensitivity to near-field obstacle perception errors. Simultaneously, by introducing local gradient consistency comparison, the mechanism transcends simple point-by-point numerical comparisons, delving into the understanding of local geometry to accurately assess the fidelity of prediction results in shape and contour. In this way, a high-quality adjudication signal is provided to the multi-sensor fusion system, enabling it to more accurately judge the real-world data quality of each sensor in complex and variable mountainous environments, thus making more robust and safer fusion decisions when faced with the challenge of partial or complete sensor failure.
[0064] Accordingly, in step S240, the confidence scores of the first and second sensors are calculated based on the cross-modal inconsistency score, the camera modal quality feature vector, and the lidar modal quality feature vector. It should be understood that since the sensor internal quality feature vector and the cross-modal inconsistency score generated in the preceding steps are independent and heterogeneous evaluation indicators, they cannot be directly used as the final basis for fusion weights. The system needs a comprehensive decision-making mechanism to determine which sensor should be blamed when inconsistency occurs. Therefore, in the technical solution of this application, the confidence scores of the first and second sensors are further calculated based on the cross-modal inconsistency score, the camera modal quality feature vector, and the lidar modal quality feature vector. This intelligently attributes and integrates the dispersed, multi-dimensional quality evaluation evidence, generating a final, single, and normalized reliability assessment score for each sensor. This provides a clear, explicit, and directly usable quantitative guidance signal for the subsequent adaptive fusion module, enabling the system to accurately adjust the information contribution of each sensor based on this score.
[0065] Specifically, in this embodiment, the calculation of a first sensor confidence score and a second sensor confidence score based on the cross-modal inconsistency score, the camera modal quality feature vector, and the lidar modal quality feature vector includes: concatenating the cross-modal inconsistency score and the camera modal quality feature vector, and then inputting the concatenation into a score regression head to obtain the first sensor confidence score; and concatenating the cross-modal inconsistency score and the lidar modal quality feature vector, and then inputting the concatenation into a score regression head to obtain the second sensor confidence score. More specifically, the calculation of this confidence score is performed on the vehicle-mounted computing unit. First, for the camera modality, the system concatenates the scalar form of the cross-modal inconsistency score calculated in the preceding steps with the vector form of the camera modal quality feature vector to form a combined feature vector containing both internal camera quality information and external consistency information. Subsequently, the combined feature vector is input into a fractional regression head dedicated to camera confidence assessment. This regression head consists of several fully connected layers and a sigmoid activation function, which performs a nonlinear mapping on the combined features and ultimately outputs a scalar value between 0 and 1, which is the first sensor confidence score. Second, in parallel, the system concatenates the same cross-modal inconsistency score with the LiDAR modal quality feature vector to generate a combined feature vector containing both internal quality information and external consistency information of the LiDAR. This vector is then input into another independent fractional regression head, which processes the input in the same way and finally calculates and outputs the second sensor confidence score, characterizing the reliability of the LiDAR data.
[0066] Specifically, in step S300, multimodal feature extraction is performed on the original image data and the original LiDAR data to obtain image feature maps and point cloud feature maps. It should be understood that since the original image pixel data and LiDAR 3D coordinate point data are high-dimensional, sparse, and heterogeneous low-level information, they lack the semantic and structural abstraction necessary for performing advanced perception tasks, making meaningful information fusion and analysis impossible directly. Therefore, in the technical solution of this application, multimodal feature extraction is further performed on the original image data and the original LiDAR data to obtain image feature maps and point cloud feature maps. This encodes the original sensing signals of different modalities in parallel into more compact, higher-level feature representations containing rich contextual information. This provides a standardized input that can be aligned and interacted with at the same abstraction level for the subsequent adaptive feature fusion module, thus laying a solid foundation for effective information complementarity and integration within the feature space.
[0067] More specifically, in a concrete example of this application, the multimodal feature extraction process is executed in parallel on the vehicle-mounted computing unit. First, for the raw image data, the system inputs it into a convolutional neural network-based image encoder, such as a ResNet network pre-trained on a large image dataset. This network extracts hierarchical information from the image layer by layer through a series of convolution, pooling, and nonlinear activation operations, ranging from low-level features such as edges and textures to high-level semantic features such as object parts, ultimately outputting a multi-channel, spatially downsampled two-dimensional image feature map. Second, for the raw LiDAR data, the system employs a voxel-based three-dimensional feature extraction network, such as the PointPillars architecture. This architecture first divides the three-dimensional point cloud into a grid of vertical cylinders on a horizontal plane, then uses a miniature PointNet network to encode the features of points within each cylinder, generating a compact cylinder feature map. Finally, all cylinder features are distributed back to a two-dimensional pseudo-image plane, forming a bird's-eye view feature representation. This representation is then processed by a two-dimensional convolutional neural network backbone to aggregate spatial context information, ultimately generating a multi-channel two-dimensional point cloud feature map. The two parallel-generated image feature maps and point cloud feature maps are then passed to the subsequent adaptive fusion process.
[0068] Specifically, in step S400, based on the confidence scores of the first and second sensors, adaptive feature fusion based on confidence modulation is performed on the image feature map and the point cloud feature map to obtain a multimodal fusion feature map of the mountainous environment. It should be understood that traditional feature fusion methods, such as simple feature concatenation or addition, are insufficient for all input features. Figure 1The lack of a dynamic arbitration mechanism means that when the data quality of a particular sensor deteriorates, its contaminated and misleading features indiscriminately pollute the final fusion result, directly causing the perception system to make incorrect judgments at critical moments. Therefore, in this application's technical solution, based on the confidence scores of the first and second sensors, adaptive feature fusion based on confidence modulation is performed on the image feature map and point cloud feature map to obtain a multimodal fusion feature map of the mountain environment. This establishes an intelligent, real-time information filtering and weighting mechanism. This mechanism dynamically adjusts the contribution of each modal feature in the fusion process using the reliability score calculated in the preceding steps. This ensures that when facing extreme situations where sensors partially or completely fail, the system can actively suppress the inflow of unreliable information and prioritize the adoption of currently higher-quality sensor features, thereby generating a more robust, reliable, and accurate final environmental perception representation under any conditions.
[0069] Figure 5 This is a flowchart illustrating how, according to embodiments of this application, adaptive feature fusion based on confidence modulation is performed on image feature maps and point cloud feature maps to obtain a multimodal fusion feature map of a mountainous environment, based on the confidence scores of a first sensor and a second sensor. (See flowchart for example.) Figure 5 As shown, step S400 includes: S410, performing a linear transformation on the image feature map to obtain an image feature query vector; S420, performing a linear transformation on the point cloud feature map to obtain a point cloud feature key vector and a point cloud feature value vector; S430, calculating the image-point cloud cross-modal attention matrix between the image feature query vector and the point cloud feature key vector based on the first sensor confidence score; S440, multiplying the image-point cloud cross-modal attention matrix by the point cloud feature value vector to obtain an image-guided point cloud enhancement feature vector; S450, performing a linear transformation on the image feature map to obtain an image feature key vector and an image feature value vector; S460, performing a linear transformation on the point cloud feature map to obtain an image feature key vector and an image feature value vector; S470: A linear transformation is performed on the cloud feature map to obtain a point cloud feature query vector; S480: Based on the confidence score of the second sensor, a point cloud-image cross-modal attention matrix is calculated between the point cloud feature query vector and the image feature key vector; S490: The point cloud-image cross-modal attention matrix is multiplied by the image feature value vector to obtain a point cloud-guided image enhancement feature vector; S401: Feature fusion is performed on the point cloud-guided image enhancement feature vector and the image-guided point cloud enhancement feature vector to obtain a mountain environment multimodal fusion feature vector; S402: Feature shape reshaping is performed on the mountain environment multimodal fusion feature vector to obtain the mountain environment multimodal fusion feature map.
[0070] Accordingly, in steps S410, S420, S430, and S440, a linear transformation is performed on the image feature map to obtain the image feature query vector, and a linear transformation is performed on the point cloud feature map to obtain the point cloud feature key vector and the point cloud feature value vector. Based on the first sensor confidence score, the image-point cloud cross-modal attention matrix between the image feature query vector and the point cloud feature key vector is calculated. The image-point cloud cross-modal attention matrix is multiplied by the point cloud feature value vector to obtain the image-guided point cloud enhancement feature vector. It should be understood that simple feature weighting or splicing fusion methods can only achieve a coarse mixing of information between modalities, lacking a mechanism for fine-grained, selective information interaction between different modal features. This results in the inability to fully utilize the rich semantic information such as texture and color of the image to enhance and disambiguate the sparse geometric structure information of the LiDAR. Therefore, in the technical solution of this application, an image feature query vector, a point cloud feature key vector, and a point cloud feature value vector are further constructed. These features are then fused based on the confidence score of the first sensor to obtain an image-guided point cloud enhancement feature vector. This establishes a controllable cross-modal information query and enhancement channel dominated by image features. In this way, when the image data is reliable, the system can use image features as probes to actively and selectively search for and aggregate the most relevant geometric information in the point cloud feature space. This accurately injects the semantic advantages of the image into the structural representation of the point cloud, generating a semantically richer and structurally clear enhanced feature. Simultaneously, by controlling the confidence score, it ensures that this enhancement process does not introduce noise when the image quality is low.
[0071] Specifically, in a concrete example of this application, the image-guided point cloud feature enhancement process is executed on an onboard computing unit. First, the system inputs the image feature map generated in the preceding steps into a linear transformation layer to generate an image feature query vector. Simultaneously, it inputs the point cloud feature map in parallel into two independent linear transformation layers to generate point cloud feature key vectors and point cloud feature value vectors, respectively. Second, the system calculates the raw attention score between the image feature query vector and the point cloud feature key vector by performing matrix multiplication on the transpose of the query vector matrix and the key vector matrix. Third, the system performs element-wise multiplication of the scalar form of the first sensor confidence score with the raw attention score matrix obtained in the preceding steps, thereby dynamically modulating the attention score according to the real-time reliability of the camera. Fourth, the system normalizes the confidence-modulated attention score matrix along a specific dimension using the Softmax function to obtain the final image-point cloud cross-modal attention matrix representing the attention weights of image features to each point cloud feature. Fifth, the system performs matrix multiplication on the image-point cloud cross-modal attention matrix and the point cloud eigenvalue vector matrix. The physical meaning of this operation is to perform a weighted summation of the point cloud eigenvalue vectors according to the attention weights; the result is the image-guided point cloud enhancement feature vector. This process is represented by the following formula:
[0072] .
[0073] in, Enhance the feature vector of the point cloud in the image. For the Softmax function, For image feature query vectors, For point cloud feature key vectors, For point cloud feature vectors, for The scale, The confidence score for the first sensor.
[0074] Accordingly, in steps S450, S460, S470, and S480, a linear transformation is performed on the image feature map to obtain the image feature key vector and the image feature value vector. A linear transformation is also performed on the point cloud feature map to obtain the point cloud feature query vector. Based on the second sensor confidence score, a point cloud-image cross-modal attention matrix is calculated between the point cloud feature query vector and the image feature key vector. This point cloud-image cross-modal attention matrix is then multiplied by the image feature value vector to obtain the point cloud-guided image enhancement feature vector. It should be understood that because only image-driven unidirectional information enhancement is performed, the precise geometric structure information of the LiDAR data cannot be used to reverse-calibrate and enhance image features that may have geometric ambiguities. For example, a realistic billboard in an image may semantically be identified as a vehicle, but it lacks the three-dimensional structural support provided by the LiDAR. Therefore, in the technical solution of this application, a point cloud feature query vector, an image feature key vector, and an image feature value vector are further constructed. These features are then fused based on the confidence score of the second sensor to obtain a point cloud-guided image enhancement feature vector. This constructs a cross-modal information query and enhancement channel dominated by point cloud features, complementary to the preceding steps. In this way, when the LiDAR data is reliable, the system can utilize its precise structural features as anchor points to actively focus on and extract corresponding semantic information with true three-dimensional support in the image feature space. This injects the geometric determinism of the point cloud into the semantic representation of the image, generating a geometrically more reliable and semantically more focused enhancement feature. Simultaneously, by adjusting the confidence score, this process ensures that it will not mislead image features when the LiDAR data quality is poor.
[0075] Specifically, in a specific example of this application, the point cloud-guided image feature enhancement process is executed in parallel with the aforementioned image-guided point cloud enhancement process on the vehicle-mounted computing unit. First, the system inputs the point cloud feature map generated in the preceding steps into a linear transformation layer to generate a point cloud feature query vector. Simultaneously, it inputs the image feature map in parallel into two independent linear transformation layers to generate an image feature key vector and an image feature value vector, respectively. Second, the system calculates the raw attention score between the point cloud feature query vector and the image feature key vector, specifically by performing matrix multiplication on the transpose of the query vector matrix and the key vector matrix. Third, the system performs element-wise multiplication of the scalar form of the second sensor confidence score with the raw attention score matrix obtained in the preceding steps, thereby dynamically modulating the attention score based on the real-time reliability of the LiDAR. Fourth, the system normalizes the confidence-modulated attention score matrix along a specific dimension using the Softmax function to obtain the final point cloud-image cross-modal attention matrix representing the attention weights of point cloud features to each image feature. Fifth, the system performs matrix multiplication on the point cloud-image cross-modal attention matrix and the image feature vector matrix. The physical meaning of this operation is to perform a weighted summation of the image feature vector according to the attention weights, and the result is the point cloud-guided image enhancement feature vector.
[0076] Accordingly, in steps S490 and S401, feature fusion is performed on the point cloud-guided image enhancement feature vector and the image-guided point cloud enhancement feature vector to obtain the mountain environment multimodal fusion feature vector, and feature shape reshaping is performed on the mountain environment multimodal fusion feature vector to obtain the mountain environment multimodal fusion feature map. It should be understood that, since the image-guided point cloud enhancement feature vector and the point cloud-guided image enhancement feature vector generated in the preceding steps are two independent feature streams optimized from different modal perspectives, although they each contain the results of cross-modal interaction, they have not yet been integrated into a unified representation that can fully reflect the collaborative benefits after two-way information exchange, and their vectorized data format cannot be directly utilized by downstream task modules that require two-dimensional spatial input. Therefore, in the technical solution of this application, feature fusion is further performed on the point cloud-guided image enhancement feature vector and the image-guided point cloud enhancement feature vector to obtain a multimodal fusion feature vector for the mountain environment. The fused feature vector is then reshaped to obtain the multimodal fusion feature map of the mountain environment. This process combines the complementary information from bidirectional enhancement into a single, holistic feature representation and restores its spatial structure, which is crucial for subsequent processing. This generates a final, unified multimodal feature map that not only achieves deep synergy and complementarity of the two modal advantages in content but also is fully compatible with standard convolutional perceptrons in form, providing the ideal input for achieving high-precision, high-robust mountain environment perception tasks.
[0077] Specifically, in a concrete example of this application, the generation process of the final fused feature map is executed on the vehicle-mounted computing unit. First, the system receives in parallel the point cloud-guided image enhancement feature vector and the image-guided point cloud enhancement feature vector generated in the preceding steps. Second, the system performs a feature fusion operation on these two feature vectors; specifically, it concatenates them along the feature channel dimension to form a higher-dimensional, more comprehensive multimodal fused feature vector for the mountainous environment, which fully retains all the results of the bidirectional attention interaction. Third, the system performs a feature shape reshaping operation on this multimodal fused feature vector for the mountainous environment. Specifically, based on the preset height and width of the bird's-eye view spatial grid used in the feature extraction stage, the system rearranges this one-dimensional fused feature vector into a three-dimensional tensor with corresponding height, width, and the number of channels after concatenation. This three-dimensional tensor is the final multimodal fused feature map for the mountainous environment, which retains rich spatial relationships and is directly passed to the decoder module for subsequent tasks such as object detection or semantic segmentation for final scene parsing.
[0078] Specifically, in step S500, the multimodal fusion feature map of the mountain environment is input into the environmental perception task head to obtain the environmental perception result. It should be understood that since the multimodal fusion feature map of the mountain environment generated in the preceding steps is an intermediate feature representation containing rich semantic and structural information, it is not a final result that can be directly used for vehicle decision-making and planning; it needs to be further interpreted and decoded. Therefore, in the technical solution of this application, the multimodal fusion feature map of the mountain environment is further input into the environmental perception task head to obtain the environmental perception result, thereby mapping this high-dimensional, abstract fusion feature to a specific, physically meaningful perception task space. In this way, the system's deep understanding of the environment can be transformed into a structured, quantifiable output, such as the precise 3D bounding boxes, positions, categories of various obstacles (e.g., vehicles, pedestrians, falling rocks) in the scene, as well as pixel-level segmentation maps of drivable areas, thus providing direct and reliable decision-making basis for downstream path planning and control modules.
[0079] More specifically, in a concrete example of this application, the generation process of the environmental perception result is executed on the vehicle-mounted computing unit, and the specific structure of the task head depends on the preset perception task. First, the system takes the multimodal fusion feature map of the mountain environment generated in the preceding steps as input. Second, if the preset task is 3D object detection, the feature map is fed into a dedicated detection head. This detection head consists of multiple parallel convolutional branches, one of which is responsible for predicting whether a target exists at each spatial location and its confidence level; another branch is responsible for regressing the precise 3D bounding box parameters of the target at that location, including center point coordinates, size, and orientation angle; and a third branch is responsible for classifying the target. Third, if the preset task is semantic segmentation under a bird's-eye view, the feature map is fed into a segmentation head. This segmentation head typically consists of a series of upsampling or deconvolutional layers used to restore the resolution of the feature map to a preset output size, and finally, through a convolutional layer and a softmax activation function, predicts a unique semantic category label for each grid cell in the bird's-eye view, such as road, vegetation, or obstacle. Ultimately, these bounding box lists or semantic segmentation maps output by the task header together constitute the final environment perception results required by this application.
[0080] In summary, the mountain environment perception method based on multi-sensor fusion according to the embodiments of this application is explained. To address the problem of perception failure in existing technologies when sensor data quality changes abruptly, it introduces a parallel real-time data quality assessment module as a pre-judgment for feature fusion. This module not only independently analyzes the internal data features of each sensor to assess its inherent quality, but also introduces a cross-modal inconsistency measure. By comparing the geometric structure understanding of the same scene by different sensors, it determines whether there are conflicts or contradictions between the data. Based on the dynamic confidence score obtained from the assessment, a confidence-modulated adaptive feature fusion mechanism is further constructed. This mechanism uses confidence as a key modulation factor to dynamically adjust the contribution weight of each sensor feature during the fusion process. Therefore, when the information provided by any sensor is unreliable, it actively suppresses its influence and prioritizes the adoption of information sources with high credibility. This ensures that the system can still output stable and reliable perception results when facing extreme challenges such as glare and dense fog, fundamentally improving the safety and robustness of unmanned systems in mountainous environments.
[0081] Furthermore, a mountain environment perception system based on multi-sensor fusion is also provided.
[0082] Figure 6 This is a block diagram of a mountain environment perception system based on multi-sensor fusion according to an embodiment of this application. Figure 6As shown, the mountain environment perception system 100 based on multi-sensor fusion according to an embodiment of this application includes: a raw data acquisition module 110, used to acquire raw image data and raw lidar data of the target mountain environment; a real-time data quality assessment module 120, used to perform parallel real-time data quality assessment on the raw image data and raw lidar data to obtain a first sensor confidence score and a second sensor confidence score; a multimodal feature extraction module 130, used to extract multimodal features from the raw image data and raw lidar data to obtain an image feature map and a point cloud feature map; a mountain environment multimodal feature fusion module 140, used to perform adaptive feature fusion based on confidence modulation on the image feature map and the point cloud feature map based on the first sensor confidence score and the second sensor confidence score to obtain a mountain environment multimodal fusion feature map; and an environment perception module 150, used to input the mountain environment multimodal fusion feature map into the environment perception task head to obtain an environment perception result.
[0083] As described above, the mountain environment perception system 100 based on multi-sensor fusion according to the embodiments of this application can be implemented in various wireless terminals, such as servers with mountain environment perception algorithms based on multi-sensor fusion. In one possible implementation, the mountain environment perception system 100 based on multi-sensor fusion according to the embodiments of this application can be integrated into the wireless terminal as a software module and / or hardware module. For example, the mountain environment perception system 100 based on multi-sensor fusion can be a software module in the operating system of the wireless terminal, or it can be an application developed for the wireless terminal; of course, the mountain environment perception system 100 based on multi-sensor fusion can also be one of many hardware modules of the wireless terminal.
[0084] The various embodiments of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.
Claims
1. A mountainous environment perception method based on multi-sensor fusion, characterized in that, include: Acquire raw image data and raw lidar data of the target mountain environment; Parallel real-time data quality assessment is performed on raw image data and raw LiDAR data to obtain a first sensor confidence score and a second sensor confidence score. This includes: extracting camera modal quality feature vectors from raw image data; extracting LiDAR modal quality feature vectors from raw LiDAR data; performing a cross-modal prediction-based inconsistency measure on the raw image data and raw LiDAR data to obtain a cross-modal inconsistency score; and calculating the first sensor confidence score and the second sensor confidence score based on the cross-modal inconsistency score, the camera modal quality feature vector, and the LiDAR modal quality feature vector. Multimodal feature extraction is performed on the original image data and the original LiDAR data to obtain image feature maps and point cloud feature maps; Based on the confidence scores of the first and second sensors, adaptive feature fusion based on confidence modulation is performed on the image feature map and the point cloud feature map to obtain a multimodal fusion feature map of the mountain environment. The multimodal fusion feature map of the mountain environment is input into the environmental perception task head to obtain the environmental perception results; The process involves performing a cross-modal prediction-based inconsistency metric on the original image data and original LiDAR data to obtain a cross-modal inconsistency score, including: The original image data is input into the monocular depth estimation model to obtain the predicted depth map; The raw lidar data is projected onto the image coordinate system through a calibration matrix to obtain a sparse lidar depth map. Local gradient calculations are performed on the predicted depth map and the sparse lidar depth map to obtain the gradient maps of the predicted depth map and the sparse lidar depth map. Calculate the absolute depth error term between the predicted depth map and the sparse lidar depth map; Calculate the local gradient consistency error term between the gradient map of the predicted depth map and the gradient map of the sparse lidar depth map; Based on the depth weighting factor determined by the sparse lidar depth map, error aggregation is performed on the absolute depth error term and the local gradient consistency error term to obtain the cross-modal inconsistency score.
2. The mountain environment perception method based on multi-sensor fusion according to claim 1, characterized in that, Extracting camera modal quality feature vectors from raw image data includes: inputting the raw image data into a lightweight convolutional neural network model to obtain camera modal quality feature vectors; Extracting lidar modal quality feature vectors from raw lidar data includes: inputting raw lidar data into a point cloud quality feature extractor based on a sparse convolutional architecture to obtain the lidar modal quality feature vectors.
3. The mountain environment perception method based on multi-sensor fusion according to claim 1, characterized in that, Based on cross-modal inconsistency scores, camera modal quality feature vectors, and lidar modal quality feature vectors, the confidence scores of the first and second sensors are calculated, including: After concatenating the cross-modal inconsistency score and the camera modal quality feature vector, the concatenation is input into the score regression head to obtain the first sensor confidence score. After concatenating the cross-modal inconsistency score and the lidar modal quality feature vector, the result is input into the score regression head to obtain the second sensor confidence score.
4. The mountain environment perception method based on multi-sensor fusion according to claim 1, characterized in that, Based on the confidence scores of the first and second sensors, adaptive feature fusion based on confidence modulation is performed on the image feature map and the point cloud feature map to obtain a multimodal fusion feature map of the mountainous environment, including: A linear transformation is performed on the image feature map to obtain the image feature query vector; A linear transformation is performed on the point cloud feature map to obtain the point cloud feature key vector and the point cloud feature value vector; Based on the confidence score of the first sensor, calculate the image-point cloud cross-modal attention matrix between the image feature query vector and the point cloud feature key vector; Multiply the image-point cloud cross-modal attention matrix by the point cloud eigenvector to obtain the image-guided point cloud enhancement eigenvector.
5. The mountain environment perception method based on multi-sensor fusion according to claim 4, characterized in that, Based on the confidence scores of the first and second sensors, adaptive feature fusion based on confidence modulation is performed on the image feature map and the point cloud feature map to obtain a multimodal fusion feature map of the mountainous environment, including: A linear transformation is performed on the image feature map to obtain the image feature key vector and the image feature value vector; A linear transformation is performed on the point cloud feature map to obtain the point cloud feature query vector; Based on the confidence score of the second sensor, the point cloud-image cross-modal attention matrix between the point cloud feature query vector and the image feature key vector is calculated; Multiply the point cloud-image cross-modal attention matrix by the image feature vector to obtain the point cloud-guided image enhancement feature vector.
6. The mountain environment perception method based on multi-sensor fusion according to claim 5, characterized in that, Based on the confidence scores of the first and second sensors, adaptive feature fusion based on confidence modulation is performed on the image feature map and the point cloud feature map to obtain a multimodal fusion feature map of the mountainous environment, including: Feature fusion is performed on the point cloud-guided image enhancement feature vector and the image-guided point cloud enhancement feature vector to obtain the multimodal fusion feature vector of the mountainous environment; The feature shape of the multimodal fusion feature vector of the mountain environment is reshaped to obtain the multimodal fusion feature map of the mountain environment.
7. A mountain environment perception system based on multi-sensor fusion, characterized in that, include: The raw data acquisition module is used to acquire raw image data and raw lidar data of the target mountain environment; The real-time data quality assessment module is used to perform parallel real-time data quality assessment on raw image data and raw LiDAR data to obtain a first sensor confidence score and a second sensor confidence score. This includes: extracting camera modal quality feature vectors from the raw image data; extracting LiDAR modal quality feature vectors from the raw LiDAR data; performing a cross-modal prediction-based inconsistency measure on the raw image data and raw LiDAR data to obtain a cross-modal inconsistency score; and calculating the first sensor confidence score and the second sensor confidence score based on the cross-modal inconsistency score, the camera modal quality feature vector, and the LiDAR modal quality feature vector. The multimodal feature extraction module is used to extract multimodal features from the original image data and the original LiDAR data to obtain image feature maps and point cloud feature maps; The mountain environment multimodal feature fusion module is used to perform adaptive feature fusion based on confidence modulation on image feature map and point cloud feature map based on the confidence scores of the first sensor and the second sensor to obtain a mountain environment multimodal fusion feature map. The environment perception module is used to input the multimodal fusion feature map of the mountain environment into the environment perception task head to obtain the environment perception results; The process involves performing a cross-modal prediction-based inconsistency metric on the original image data and original LiDAR data to obtain a cross-modal inconsistency score, including: The original image data is input into the monocular depth estimation model to obtain the predicted depth map; The raw lidar data is projected onto the image coordinate system through a calibration matrix to obtain a sparse lidar depth map. Local gradient calculations are performed on the predicted depth map and the sparse lidar depth map to obtain the gradient maps of the predicted depth map and the sparse lidar depth map. Calculate the absolute depth error term between the predicted depth map and the sparse lidar depth map; Calculate the local gradient consistency error term between the gradient map of the predicted depth map and the gradient map of the sparse lidar depth map; Based on the depth weighting factor determined by the sparse lidar depth map, error aggregation is performed on the absolute depth error term and the local gradient consistency error term to obtain the cross-modal inconsistency score.