3D semantic occupancy prediction method and system based on multi-modal fusion
By using a world model architecture with sparse voxel encoding and dynamic-static decoupling, we have achieved efficient fusion and prediction of multimodal data in a unified three-dimensional space. This solves the shortcomings of single-modal perception in autonomous driving systems and improves environmental perception and prediction capabilities, especially in terms of robustness and safety in complex dynamic environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG UNIV OF TECH
- Filing Date
- 2026-02-13
- Publication Date
- 2026-06-19
AI Technical Summary
In existing autonomous driving systems, single-modal perception schemes suffer from high costs, susceptibility to severe weather, insufficient depth estimation, and a lack of predictive ability for dynamic environmental changes. Furthermore, multimodal fusion methods fail to achieve unified 3D spatial representation and efficient semantic modeling, resulting in insufficient robustness.
A world model architecture with sparse voxel encoding and dynamic-static decoupling is adopted. Multimodal data is mapped to a unified three-dimensional space through sparse hash indexing. Cross-modal attention mechanism is used for feature fusion. A dynamic-static separation dual-stream world model is designed for prediction. Finally, a 3D semantic occupancy map is generated through uncertainty decoding and arbitration.
It achieves high-precision and robust prediction of future scene evolution, improves the perception and prediction capabilities of autonomous driving systems in complex dynamic environments, reduces computational complexity, and enhances the robustness and safety of the system.
Smart Images

Figure CN122244826A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of autonomous driving and 3D scene modeling technology, specifically involving a 3D semantic occupancy prediction method and system based on multimodal fusion. Background Technology
[0002] In autonomous driving systems, accurate and robust environmental perception is a fundamental prerequisite for reliable decision-making and planning. Currently, mainstream perception solutions mainly rely on single sensor modalities and build perception algorithm systems on this basis. Among them, LiDAR can directly acquire high-precision three-dimensional spatial information, constructing accurate spatial geometry for the vehicle, but it is costly and susceptible to adverse weather conditions. Another widely used solution is vision-based perception, which utilizes cameras to provide rich textures and semantics, but vision-based depth estimation capabilities are insufficient, and it is sensitive to changes in ambient lighting, easily leading to performance fluctuations.
[0003] To overcome the inherent limitations of single-modality approaches, existing technologies have proposed multimodal fusion methods, such as feature fusion of laser point clouds and camera images. However, existing fusion methods have some fundamental limitations: First, their fusion levels are mostly limited to the data or decision level. Due to the heterogeneity of sensor data, different coordinate systems, and inconsistent feature representations, they fail to achieve deep fusion under a unified three-dimensional spatial representation. Second, most current perception models belong to the static perception paradigm, which can only analyze sensor inputs at a single moment and output the scene state at that moment, failing to effectively capture and predict the spatiotemporal dynamic evolution of the scene. More importantly, existing systems generally lack high-dimensional world models capable of implicit or explicit reasoning about dynamic environmental changes, making it difficult to cope with the widespread uncertainties, occlusions, and interaction complexities in real driving environments, thus limiting the system's prediction and planning capabilities.
[0004] In recent years, 3D semantic occupancy prediction technology has shown significant potential as an emerging perception paradigm. This technology continuously predicts the occupancy state and semantic category (such as vehicle, pedestrian, road, etc.) of each voxel space using a 3D voxel grid as the basic unit, thus providing a unified representation foundation for perception, prediction, and planning tasks. However, most current semantic occupancy prediction methods are still mainly based on single-modal input, lacking robustness in complex scenarios such as sensor limitations or data sparseness. Furthermore, existing methods typically output static occupancy grids for the current frame or short time series, essentially lacking the ability to proactively and coherently infer the scene state over future multi-frame time series.
[0005] Therefore, to meet the practical needs of advanced autonomous driving, there is an urgent need to develop a novel 3D semantic occupancy prediction method and system. This system should efficiently integrate multimodal perception information from LiDAR and multi-view vision to perform semantically accurate and robust semantic modeling within a unified 3D occupancy space. Furthermore, it is also necessary to leverage world models to predict the future state of the environment, thereby constructing a more global, temporal, and reasoning-capable perception framework to better serve the integrated perception, prediction, and planning requirements of autonomous driving. Summary of the Invention
[0006] In view of the above, the purpose of this invention is to provide a 3D semantic occupancy prediction method and system based on multimodal fusion. Through a world model architecture of sparse voxel encoding and dynamic-static decoupling, it efficiently fuses LiDAR and multi-view vision data to achieve high-precision and robust prediction of the future scene evolution state and outputs a 3D semantic occupancy map with spatial consistency and semantic accuracy. This improves the perception and prediction capabilities of autonomous driving systems in complex dynamic environments and is applicable to scenarios such as autonomous driving, environmental perception, and 3D reconstruction.
[0007] To achieve the above-mentioned objectives, the present invention provides the following technical solution: In a first aspect, the present invention provides a 3D semantic occupancy prediction method based on multimodal fusion, comprising the following steps: Camera images and LiDAR point clouds are mapped to a unified 3D space and preprocessed with sparse voxel encoding based on sparse hash index to obtain sparsed multimodal data. We utilize a cross-modal attention mechanism to extract and fuse features from sparse multimodal data, constructing a unified multimodal sparse voxel feature. Multimodal sparse voxel features are input into a dynamic-static separation dual-flow world model. The static flow uses the vehicle motion transformation matrix to perform deterministic rigid body coordinate transformation prediction, while the dynamic flow uses a neural network to perform probabilistic motion trajectory evolution prediction. Finally, the two prediction results are recombined in a unified coordinate system to generate sparse features for future time prediction. Uncertainty decoding and arbitration are performed based on predicted sparse features to generate the final 3D semantic occupancy map.
[0008] Preferably, the step of mapping camera images and LiDAR point clouds to a unified three-dimensional space and performing sparse voxel encoding preprocessing based on a sparse hash index to obtain sparsed multimodal data includes: Multi-scale 2D features are extracted from camera images, and the discrete depth probability distribution of each feature point is predicted. The 2D features are then enhanced to three-dimensional space by combining camera parameters to generate visual frustum features. Motion compensation and denoising are performed on the lidar point cloud. Each point is assigned to the corresponding voxel grid according to the preset resolution, and its relative position offset is calculated to expand the point features. A sparse hash index is constructed to store the hash key values of non-empty voxel coordinates, and the voxel geometric features are extracted and aggregated from the multi-point cloud data within each non-empty voxel using a neural network. Visual cone features and voxel geometric features are aligned and fused in the same three-dimensional space defined by the sparse hash index to generate sparse multimodal data.
[0009] Preferably, during alignment and fusion, the voxel geometric features of the lidar point cloud and the visual frustum features of the camera image are combined by channel stitching or dynamic addition based on the coordinate alignment results to obtain a fused feature vector, so as to obtain a sparse feature tensor containing a list of coordinate indices and corresponding fused feature vectors as sparse multimodal data.
[0010] Preferably, the step of using a cross-modal attention mechanism to extract and fuse features from sparse multimodal data to construct a unified multimodal sparse voxel feature includes: Use the voxel geometric features in the sparse multimodal data as the query vector, and flatten the visual cone features in the sparse multimodal data as the key vector and value vector. Multi-head cross-attention computation is performed based on query vector, key vector, and value vector to aggregate visual semantic information into geometric features; Learnable gating coefficients are introduced to adaptively weight and fuse the aggregated features with the voxel geometric features in the original sparse multimodal data to generate multimodal sparse voxel features.
[0011] Preferably, the static flow utilizes the vehicle motion transformation matrix to perform deterministic rigid body coordinate transformation prediction, including: Based on the semantic classification results of the multimodal sparse voxel features at the current moment, voxel features belonging to the static background are separated. Obtain the motion transformation matrix of the vehicle between the current moment and the next moment; Rigid body transformation is performed on the coordinate index of the static background voxels using the vehicle motion transformation matrix, and the corresponding static background voxel features are mapped to the transformed coordinate positions through interpolation.
[0012] Preferably, the dynamic flow utilizes a neural network to predict the probabilistic evolution of its motion trajectory, including: Based on the semantic classification results of the multimodal sparse voxel features at the current moment, voxel features belonging to the dynamic foreground are separated. The voxel features of the dynamic foreground of the historical frame are input into the temporal neural network to predict the displacement vector and feature evolution residual of each dynamic voxel at the next time step. The coordinates of the dynamic voxels are updated based on the displacement vector, and the voxel features of the corresponding dynamic foreground are updated using the feature evolution residual.
[0013] Preferably, the step of performing uncertainty decoding and arbitration based on predicted sparse features to generate the final 3D semantic occupancy map includes: Based on the predicted sparse features, the first evidence quantity guided by geometry and the second evidence quantity guided by semantics are generated by parallel decoding branches, and the cognitive uncertainty corresponding to each vector is calculated. For each voxel, the geometric occupancy state prediction and semantic category prediction results obtained based on the first evidence quantity and the second evidence quantity are compared; when the two evidence vectors are inconsistent in their determination of whether it is occupied, occupancy state arbitration is triggered, and the occupancy state corresponding to the vector with the lowest cognitive uncertainty is adopted; when both are determined to be occupied but the semantic categories are different, semantic category arbitration is triggered, and the semantic category corresponding to the vector with the higher total amount of semantic evidence is adopted. Based on the evidence vector after arbitration, the final semantic category and occupancy state of each voxel are determined, and a 3D semantic occupancy map with uncertainty information is generated.
[0014] Secondly, embodiments of the present invention also provide a 3D semantic occupancy prediction system based on multimodal fusion, which is implemented using the above-mentioned 3D semantic occupancy prediction method based on multimodal fusion, including: a multimodal sparse coding module, a cross-modal feature fusion module, a two-stream world model prediction module, and an uncertainty decoding arbitration module; The multimodal sparse coding module is used to map camera images and lidar point clouds to a unified three-dimensional space and perform sparse voxel coding preprocessing based on sparse hash index to obtain sparsed multimodal data. The cross-modal feature fusion module is used to extract and fuse features from sparse multimodal data using a cross-modal attention mechanism, and construct a unified multimodal sparse voxel feature. The dual-stream world model prediction module is used to input multimodal sparse voxel features into a dynamic-static separation dual-stream world model. The static stream uses the vehicle motion transformation matrix to perform deterministic rigid body coordinate transformation prediction, while the dynamic stream uses a neural network to perform probabilistic motion trajectory evolution prediction. Finally, the two prediction results are recombined in a unified coordinate system to generate sparse features for future time prediction. The uncertainty decoding and arbitration module is used to perform uncertainty decoding and arbitration based on predicted sparse features to generate the final 3D semantic occupancy map.
[0015] Thirdly, embodiments of the present invention also provide an electronic device, including a memory and one or more processors, wherein the memory is used to store a computer program, and the processor is used to implement the above-described 3D semantic occupancy prediction method based on multimodal fusion when executing the computer program.
[0016] Fourthly, embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which, when executed by a computer, implements the above-described 3D semantic occupancy prediction method based on multimodal fusion.
[0017] Compared with the prior art, the beneficial effects of the present invention include at least the following: (1) This invention introduces a sparse hash index mechanism at the data input end to perform sparse voxel encoding preprocessing, maps multimodal data to a unified three-dimensional space and calculates only non-empty voxels, thus achieving efficient representation of high-resolution three-dimensional spatial features, reducing computational complexity from the volume level to the number of points occupied in the actual scene, and significantly solving the problems of excessive memory usage and insufficient real-time inference in large-scale scenes.
[0018] (2) The present invention designs a dynamic-static separation dual-flow world model prediction architecture, which explicitly separates the scene into a static background flow and a dynamic foreground flow. It uses deterministic rigid body transformation to process the static background flow and probabilistic neural network to process the dynamic foreground flow. While maintaining the advantage of independent modeling of each branch, it effectively eliminates the background ghosting and dynamic blurring problems caused by the vehicle motion in traditional world models, and ensures the geometric stability and trajectory evolution accuracy in long-term prediction.
[0019] (3) This invention constructs a decision framework for uncertainty decoding and arbitration based on predictive sparse features. By replacing the traditional truth-dependent evaluation with quantitative cognitive uncertainty, it realizes the adaptive resolution of multimodal semantic conflicts, significantly improves the robustness and safety of 3D scene occupancy prediction under extreme conditions such as low light and sensor occupancy, and provides more reliable technical support for the perception decision of autonomous driving system in complex dynamic environment. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 This is a flowchart illustrating the 3D semantic occupancy prediction method based on multimodal fusion provided in an embodiment of the present invention. Figure 2 This is a schematic diagram of the framework of the 3D semantic occupancy prediction method based on multimodal fusion provided in the embodiments of the present invention; Figure 3 This is a schematic diagram of the data preprocessing and feature extraction process provided in an embodiment of the present invention; Figure 4 This is a schematic diagram of the prediction process of the dynamic-static decoupled world model provided in an embodiment of the present invention; Figure 5 This is a schematic diagram of the uncertainty arbitration decision-making process provided in an embodiment of the present invention; Figure 6 This is a schematic diagram of the structure of a 3D semantic occupancy prediction system based on multimodal fusion provided in an embodiment of the present invention. Detailed Implementation
[0022] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the scope of protection of this invention.
[0023] The inventive concept of this invention is as follows: Addressing the problems of incomplete single-modal perception information, high computational resource consumption of dense voxels, fuzzy dynamic scene prediction, and unreliable fusion decisions due to a lack of ground truth in existing 3D environment perception and prediction technologies, this invention provides a 3D semantic occupancy prediction method and system based on multimodal fusion. First, a sparse voxel encoding mechanism is used to map multimodal data to a unified 3D space, significantly reducing computational redundancy. Second, a dynamic-static separation dual-stream world model is designed, performing deterministic rigid body transformation and probabilistic trajectory prediction on the static background and dynamic foreground respectively, eliminating scene ghosting caused by vehicle motion. Finally, an uncertainty decoding and arbitration mechanism based on evidence theory is introduced to achieve reliable decision-making on multimodal prediction conflicts without the need for ground truth, outputting a spatially consistent and semantically accurate 3D semantic occupancy map with quantifiable confidence, thereby achieving high-precision prediction of future scene evolution and providing reliable and dynamic environmental understanding information for autonomous driving decision-making and planning.
[0024] like Figure 1 and Figure 2 As shown in the embodiment, a 3D semantic occupancy prediction method based on multimodal fusion is provided, including the following steps: S1 maps camera images and LiDAR point clouds to a unified 3D space and performs sparse voxel encoding preprocessing based on sparse hash index to obtain sparsed multimodal data.
[0025] In this embodiment, this step aims to establish a unified 3D spatial representation by spatiotemporally aligning multimodal heterogeneous 2D visual data (RGB images) with 3D point cloud data (LiDAR point clouds) through depth estimation and point cloud voxelization. Unlike traditional methods that construct a full dense voxel mesh, this step introduces a sparse voxel encoding strategy. By using hash mapping, only the coordinate indices and feature vectors of non-idle voxels are extracted. That is, only the "active regions" in the physical environment are retained through hash indexes, generating sparse visual frustum features of RGB images and voxel geometric features of LiDAR point clouds. This filters out invalid spatial computational redundancy, thereby significantly reducing memory usage while preserving high-frequency details.
[0026] like Figure 3 As shown, the specific implementation process is detailed below: S1.1 Constructing visual frustum features based on depth distribution. First, N (6 in this example) surround-view RGB images acquired by the vehicle's multi-camera system are obtained, normalized, and input into a shared-weight 2D backbone network and Feature Pyramid Network (FPN) to extract multi-scale 2D image features with a downsampling factor of s (8 in this example). Then, a lightweight depth estimation head is used to predict the discrete depth probability distribution of each image feature point in the frustum direction. An outer product operation is performed using the camera intrinsic matrix and the extrinsic matrix relative to the LiDAR coordinate system, i.e., multiplying the 2D features by the depth probability to generate visual frustum features. Depth probability Represented as: , in, For depth prediction convolutional layers; The Softmax function is used to transform the depth response values output by the convolutional layer into a probability distribution. The extracted multi-scale 2D image features.
[0027] Visual cone features The three-dimensional coordinates of each view frustum feature point The calculation formula is as follows: , in, For pixel coordinates, Let it be its corresponding homogeneous coordinate vector. For the depth probability distribution, the th The center depth value of each depth interval For the camera intrinsic parameter matrix, and These are the rotation matrix and translation vector of the camera relative to the LiDAR, respectively, with superscripts indicating their respective rotation and translation. This indicates transposition. This step involves calculating the three-dimensional coordinates of each view frustum feature point. This effectively diffuses 2D semantic information to its possible 3D spatial locations and stores it in tensor form. middle.
[0028] S1.2, acquire the raw point cloud data collected by the LiDAR. First, use the vehicle's high-frequency odometer data to perform motion compensation on the LiDAR point cloud based on timestamps to eliminate point cloud distortion caused by the vehicle's high-speed movement, and remove noise points outside the preset region of interest (ROI). Then, set the voxel resolution to... , , and , respectively, represent the physical lengths of the voxel along the X, Y, and Z axes. For any voxel in the motion-compensated and denoised point cloud... Valid points (in (For reflection intensity), calculate the coordinates of its corresponding voxel grid. voxel grid coordinates The calculation formula is as follows: , in, Let the coordinates of the point be 3D coordinates. The starting coordinates of the region of interest. This is a rounding down operation. Based on the grid coordinates, the corresponding physical coordinates of the voxel geometric center are derived using the following formula: ( (Similarly for axes); then calculate the relative position offset of the effective point with respect to the geometric center of the voxel. The calculation formula is: , Finally, the 3D coordinates, reflection intensity, and relative position offset of the original point cloud are concatenated along the feature channel dimension to generate the point features after dimensional expansion. This completes the mapping from discrete point data to enhanced voxel features containing information about local relative geometric distribution.
[0029] S1.3, Construct and optimize the sparse hash index. To address the significant memory consumption caused by dense meshes, this invention does not construct the entire 3D mesh but instead establishes a coordinate hash table. For each non-empty voxel coordinate calculated in S1.2, a linear congruent or XOR hash function is used to map it to a unique hash key value and store it in the hash table. For multiple point cloud data falling into the same voxel, this step introduces a voxel feature encoding layer. A multilayer perceptron (MLP) with shared weights performs high-dimensional mapping on each point within the voxel, extracts local geometric features, and then performs max pooling to generate voxel geometric features that can describe the local surface curvature. The formula for this feature extraction process is expressed as follows: , in, For voxels An expanded point feature, It is a multilayer perceptron network. This is a max pooling operation.
[0030] S1.4, Sparse Soft Alignment and Tensor Construction of Multimodal Features. Using the sparse hash table constructed in S1.3 as the global index base, the view frustum feature points generated in S1.1 are traversed. Each view frustum feature point contains not only three-dimensional coordinates... It also carries the corresponding depth predicted in S1.1. probability value (This value represents geometric existence, distinct from the semantic category probability in the subsequent S4 step). The 3D coordinates calculated from the view frustum feature points are queried in a hash table. If the coordinates fall within an existing sparse voxel index, the visual frustum feature and the voxel geometric feature are concatenated and fused to obtain a fused feature vector as sparse multimodal data; if the coordinates fall within an empty region but the depth probability of the visual frustum feature is... If the value exceeds a preset threshold (set to 0.4 in this example), the voxel index is dynamically added to the hash table and the visual frustum feature is entered. The final output is a sparse feature tensor. ,in For inclusion A list of coordinate indices for each active voxel. This represents the corresponding fusion feature vector. This data structure strictly limits the computational complexity of subsequent steps to the non-empty region, enabling the system to efficiently process large-scale 3D scenes with linear complexity.
[0031] S2 utilizes a cross-modal attention mechanism to extract and fuse features from sparse multimodal data, constructing a unified multimodal sparse voxel feature.
[0032] In this embodiment, this step aims to address the deep fusion problem of multimodal data in the feature space. Utilizing the high-precision 3D coordinates of LiDAR point clouds as geometric anchors, the most relevant texture semantics are dynamically retrieved and aggregated from visual frustum features. A cross-modal attention mechanism is then used to inject the semantic features of RGB images into the sparse geometric skeleton of the LiDAR point cloud, generating a unified multimodal sparse voxel feature that combines geometric accuracy with semantic richness. This feature contains both the high-precision geometric information of the LiDAR point cloud and the rich semantic texture of the RGB image, serving as the initial input state for the world model.
[0033] The specific implementation process is detailed as follows: S2.1, Constructing Sparse Queries and Key-Value Pairs. First, the voxel geometric features from the sparsed multimodal data output from S1 are input into a self-attention layer to extract their local contextual relationships. Sine positional encoding is then added to preserve the absolute spatial location information of the voxels, generating the query vector. Simultaneously, the visual cone features in the sparsed multimodal data output by S1 are flattened and used as key vectors. Sum value vector Furthermore, to improve retrieval efficiency, this invention selects only visual feature points within a certain radius (2.0 meters in this example) of the LiDAR voxel center for calculation, thereby achieving localized feature association. Query Vector Key vector Sum value vector The construction process can be represented as: , , , in, This indicates a position encoding operation. Visual cone features that fall within the geometric neighborhood of a LiDAR voxel The set of indices. , , These are linear projection weight matrices for the query, key, and value, respectively, used to map features to the attention space.
[0034] S2.2, Multi-head Sparse Cross-Attention Aggregation. Utilizing a multi-head attention mechanism, geometrically precisely positioned sparse voxels from LiDAR actively retrieve the RGB visual features that contribute most to them. Specifically, for the first voxel from LiDAR point cloud data... Each sparse voxel is used, and its corresponding geometric features are used as a query vector. This query vector is then compared with a key vector generated from visual frustum features derived from an RGB image, and their dot product similarity is calculated. Subsequently, the calculated attention weights are processed... Normalization involves weighted summation of the value vectors of visual frustum features derived from RGB images, thereby aggregating the corresponding visual semantic information into the LiDAR voxel. Fusion characteristics of sparse voxels The calculation formula is as follows: , in, For the number of attention heads, For the first The output projection matrix of each attention head, For the first The first one in the attention. Geometric feature query vector of a sparse voxel of a LiDAR; and The first Each attention point falls into the neighborhood index set The set of key vectors and the set of value vectors corresponding to the RGB visual cone features within the image; This is a scaling factor for the feature dimension, used to prevent gradient vanishing due to excessively large dot product values. (Superscript) This is a transpose. This process introduces an output projection matrix and a feature dimension scaling factor to prevent gradient vanishing, effectively aggregating RGB visual semantic information into the LiDAR geometric features to obtain the final fused features. .
[0035] S2.3, Adaptive Gated Fusion. Considering that visual features may fail in low-light or textureless regions, this invention introduces a learnable gating coefficient. Features after attention aggregation Compared with the original LiDAR voxel geometry The concatenated data is input into a multilayer perceptron to predict gating weights, and then the final fusion is achieved through residual connections. The final multimodal sparse voxel features are then obtained. The calculation is as follows: , in, The gating coefficient, This is the Sigmoid activation function. This mechanism allows the model to dynamically adjust the contribution ratio of geometric and visual information based on ambient lighting and texture richness.
[0036] S3 inputs multimodal sparse voxel features into a dynamic-static separation dual-flow world model. The static flow uses the vehicle motion transformation matrix to perform deterministic rigid body coordinate transformation prediction, while the dynamic flow uses a neural network to perform probabilistic motion trajectory evolution prediction. Finally, the two prediction results are recombined in a unified coordinate system to generate sparse features for future time prediction.
[0037] In this embodiment, this step employs a dual-stream decoupled prediction architecture. Prior knowledge is used to explicitly separate the scene into a static flow influenced by the vehicle's motion and a dynamic flow influenced by the objects' own motion. These are modeled separately and then recombined to address the background ghosting problem generated by traditional world models. Specifically, the static flow uses the vehicle's motion matrix for deterministic rigid body coordinate transformation prediction, while the dynamic flow uses a neural network for probabilistic trajectory evolution prediction. Finally, the two prediction results are recombined in a unified coordinate system to generate sparse features for future timeframes.
[0038] like Figure 4 As shown, the specific implementation process is detailed below: S3.1, Semantic-guided decoupling of dynamic and static elements. A lightweight semantic segmentation head is used to coarsely classify the sparse features at the current time step, defining a binary mask based on the physical attributes of the semantic labels. Vehicles, pedestrians, etc., are labeled as dynamic, while roads, buildings, etc., are labeled as static. Based on this mask, the input feature tensor is decomposed into voxel features belonging to the static background along either the channel or index dimension. and voxel features belonging to the dynamic foreground The splitting process is represented as follows: , , in, This is a dynamic binary mask generated based on semantic categories.
[0039] S3.2, Deterministic Rigid Body Transformation of Static Flow. For a static background flow, its position change in the world coordinate system is entirely determined by the vehicle's motion. Obtain the vehicle's motion transformation matrix provided by the vehicle's CAN bus or IMU. A rigid body transformation is performed on the coordinate indices of all static voxels, and trilinear interpolation is used to remap the transformed features back to the nearest hash grid node. Next time step, static voxel coordinates... and characteristics The calculation formula is as follows: , , in, To indicate that the coordinates are from Transformation from vehicle coordinate system to The transformation matrix of the vehicle coordinate system at any given time. and These are the coordinates and features of the static voxel at the current moment; This represents a trilinear interpolation sampling operation. The final output is a static flow prediction result containing the new coordinates and features. , This step ensures the structural stability of the background prediction through explicit geometric transformation and feature resampling.
[0040] S3.3, Probabilistic Trajectory Evolution of Dynamic Flow. For dynamic foreground flow, a sparse temporal Transformer model is used to capture the motion trend of objects. Input is a sequence of dynamic features from historical frames. , For a moment The dynamic foreground sparse voxel features (i.e., the moving object features retained only after semantic masking). Indexing the time span of historical frames The model predicts the displacement vector and feature residual of each dynamic voxel at the next time step, thereby simulating nonlinear motion behaviors such as vehicle acceleration and turning. The formula for the prediction process is expressed as follows: , , , in, For the predicted three-dimensional displacement vector, For the predicted characteristic evolution residuals, For the predicted coordinates of the dynamic voxel at the next time step, This provides the predicted features for the dynamic voxels at the next time step (used for decoding and arbitration in the subsequent S4 step). This step, through explicit modeling of the dual evolution of position and features, can accurately simulate nonlinear behaviors such as vehicle acceleration, turning, and changes in appearance.
[0041] S3.4, Sparse Feature Recombination and Collision Handling. The predicted static and dynamic streams are merged in the same hash table. Let the remapped static sparse features output from S3.2 be denoted as { , The evolved dynamic sparse features of the S3.3 output are { , First, construct a global coordinate index set for future moments. It is the union of static and dynamic coordinate indices, and its calculation formula is: ,against Each coordinate index in Determine its corresponding fusion features If static voxels and dynamic voxels overlap at the same spatial location (i.e. ∈ If there is no overlap, then the dynamic priority principle is followed, directly preserving dynamic features to simulate the physical phenomenon of foreground occlusion of the background; if there is no overlap, then the corresponding features are preserved. The final generated future time-time prediction sparse tensor... The feature merging logic formula is as follows: If ,but If c∉ ,but This step, through the aforementioned set operations and conditional selection mechanism, effectively solves the spatial occlusion and feature conflict problems caused by the coupling of vehicle motion and object motion during the prediction process, and generates a prediction sparse tensor containing complete environmental information.
[0042] S4 performs uncertainty decoding and arbitration based on predicted sparse features to generate the final 3D semantic occupancy map.
[0043] In this embodiment, this step quantifies the semantic confidence and cognitive uncertainty of each voxel based on evidence theory, dynamically evaluating prediction quality without requiring truth values. When RGB and LiDAR branch predictions conflict, a semantic arbitration mechanism is activated, integrating geometric uncertainty and semantic evidence to generate the final 3D semantic occupancy map, ensuring the robustness of the fusion result under complex conditions.
[0044] The specific implementation process is detailed as follows: S4.1, Evidence prediction based on Dirichlet distribution. The predicted sparse features are input into the decoder, which does not directly pass... Instead of outputting normalized class probabilities, the layer outputs a non-negative evidence vector. Assuming the class probabilities follow a Dirichlet distribution, the th... The predicted probability of a class is the expected value of its distribution. Class evidence vector With predicted probability The relationship is as follows: , in, Expressing support for the first The amount of observational evidence in each category The total number of semantic categories. An index for semantic categories.
[0045] S4.2, Quantification of Cognitive Uncertainty. Based on evidence theory, the cognitive uncertainty of each voxel is calculated. This metric reflects the degree of "ignorance" the model suffers due to a lack of evidence, such as when faced with an unseen obstacle or a completely occluded area. Cognitive Uncertainty The calculation formula is as follows: , Among them, the denominator This represents the total strength of the Dirichlet distribution. The greater the total amount of evidence, the greater the uncertainty. The closer to 0, the more likely it is to be negative; conversely, when evidence is scarce, Close to 1.
[0046] S4.3, Multimodal Semantic Conflict Arbitration. For example... Figure 5 As shown, independent evidence predictions are performed on the predicted sparse features of the RGB branch and the predicted sparse features of the LiDAR branch, respectively, resulting in two sets of uncertainties ( ) and quantity of evidence ( ), and These are the second uncertainty value guided by semantics and the first uncertainty value guided by geometry, respectively. and These are the semantically guided second evidence quantity and the geometrically guided first evidence quantity, respectively, with the evidence quantities represented in vector form. Based on this, preliminary semantic prediction labels for each branch are first generated: [denotes...] The semantic category (such as vehicle, road, etc.) predicted for the RGB branch. The semantic category is predicted by the LiDAR branch. When the predicted categories for the same voxel differ, an arbitration mechanism is triggered. If a geometric conflict involves "occupied" or "idle" states, the result with lower geometric uncertainty is prioritized; if a semantic conflict involves both voxels being classified as "occupied" but in different categories, the result with higher total semantic evidence is prioritized, thus combining the advantages of multimodal analysis to resolve the conflict. The arbitration logic is formalized as follows: , in, and This represents the total amount of evidence in the RGB and LiDAR branches. This refers to the semantic category after arbitration.
[0047] S4.4, Final Occupation Graph Generation. Based on the final semantic category from the arbitration output of step S4.3. The expected probability corresponding to this category is calculated using the Dirichlet distribution expectation formula in step S4.1. This is used to quantify the confidence level of the classification. Specifically, an effective confidence level threshold is set. (The value in this example is 0.4). If the calculated... This indicates that although the system currently tends to favor categories... However, if the evidence is insufficient to form a reliable decision, the voxel should be marked as "unknown" or its previous state should be retained to reduce the false detection rate; if Then confirm This is the final semantic label. The final output 3D semantic occupancy map not only integrates the semantic labels from... The defined semantic category and occupancy state, along with... The corresponding confidence distribution and its relation to The corresponding uncertainty variance diagram can provide an intuitive risk assessment basis for the downstream path planning module, ensuring the safety and robustness of the system.
[0048] Based on the same inventive concept, such as Figure 6 As shown, this embodiment of the invention also provides a 3D semantic occupancy prediction system 600 based on multimodal fusion, including: a multimodal sparse coding module 610, a cross-modal feature fusion module 620, a two-stream world model prediction module 630, and an uncertainty decoding arbitration module 640.
[0049] The multimodal sparse coding module 610 is used to map camera images and LiDAR point clouds to a unified three-dimensional space and perform sparse voxel coding preprocessing based on sparse hash index to obtain sparsed multimodal data.
[0050] The cross-modal feature fusion module 620 is used to extract and fuse features from sparse multimodal data using a cross-modal attention mechanism, and construct a unified multimodal sparse voxel feature.
[0051] The dual-stream world model prediction module 630 is used to input multimodal sparse voxel features into a dynamic-static separation dual-stream world model. The static stream uses the vehicle motion transformation matrix to perform deterministic rigid body coordinate transformation prediction, while the dynamic stream uses a neural network to perform probabilistic motion trajectory evolution prediction. Finally, the two prediction results are recombined in a unified coordinate system to generate the predicted sparse features for future times.
[0052] The uncertainty decoding arbitration module 640 is used to perform uncertainty decoding and arbitration based on predicted sparse features to generate the final 3D semantic occupancy map.
[0053] Based on the same inventive concept, embodiments of the present invention also provide an electronic device, including a memory and one or more processors, wherein the memory is used to store a computer program, and the processor is used to implement the above-described 3D semantic occupancy prediction method based on multimodal fusion when executing the computer program.
[0054] Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which, when executed by a computer, implements the above-described 3D semantic occupancy prediction method based on multimodal fusion.
[0055] It should be noted that the 3D semantic occupancy prediction system, electronic device, and computer-readable storage medium based on multimodal fusion provided in the above embodiments all belong to the same inventive concept as the 3D semantic occupancy prediction method based on multimodal fusion. For details of their specific implementation process, please refer to the embodiments of the 3D semantic occupancy prediction method based on multimodal fusion, which will not be repeated here.
[0056] The specific embodiments described above illustrate the technical solution and beneficial effects of the present invention in detail. It should be understood that the above description is only the most preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, additions, and equivalent substitutions made within the scope of the principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A 3D semantic occupancy prediction method based on multimodal fusion, characterized in that, Includes the following steps: Camera images and LiDAR point clouds are mapped to a unified 3D space and preprocessed with sparse voxel encoding based on sparse hash index to obtain sparsed multimodal data. We utilize a cross-modal attention mechanism to extract and fuse features from sparse multimodal data, constructing a unified multimodal sparse voxel feature. Multimodal sparse voxel features are input into a dynamic-static separation dual-flow world model. The static flow uses the vehicle motion transformation matrix to perform deterministic rigid body coordinate transformation prediction, while the dynamic flow uses a neural network to perform probabilistic motion trajectory evolution prediction. Finally, the two prediction results are recombined in a unified coordinate system to generate sparse features for future time prediction. Uncertainty decoding and arbitration are performed based on predicted sparse features to generate the final 3D semantic occupancy map.
2. The 3D semantic occupancy prediction method based on multimodal fusion according to claim 1, characterized in that, The process of mapping camera images and LiDAR point clouds to a unified 3D space and performing sparse voxel encoding preprocessing based on a sparse hash index to obtain sparsed multimodal data includes: Multi-scale 2D features are extracted from camera images, and the discrete depth probability distribution of each feature point is predicted. The 2D features are then enhanced to three-dimensional space by combining camera parameters to generate visual frustum features. Motion compensation and denoising are performed on the lidar point cloud. Each point is assigned to the corresponding voxel grid according to the preset resolution, and its relative position offset is calculated to expand the point features. A sparse hash index is constructed to store the hash key values of non-empty voxel coordinates, and the voxel geometric features are extracted and aggregated from the multi-point cloud data within each non-empty voxel using a neural network. Visual cone features and voxel geometric features are aligned and fused in the same three-dimensional space defined by the sparse hash index to generate sparse multimodal data.
3. The 3D semantic occupancy prediction method based on multimodal fusion according to claim 2, characterized in that, During alignment and fusion, the voxel geometric features of the LiDAR point cloud and the visual frustum features of the camera image are combined by channel stitching or dynamic addition based on the coordinate alignment results to obtain a fused feature vector. This results in a sparse feature tensor containing a list of coordinate indices and the corresponding fused feature vectors, which is used as sparse multimodal data.
4. The 3D semantic occupancy prediction method based on multimodal fusion according to claim 2, characterized in that, The method utilizes a cross-modal attention mechanism to extract and fuse features from sparse multimodal data, constructing a unified multimodal sparse voxel feature set, including: Use the voxel geometric features in the sparse multimodal data as the query vector, and flatten the visual cone features in the sparse multimodal data as the key vector and value vector. Multi-head cross-attention computation is performed based on query vector, key vector, and value vector to aggregate visual semantic information into geometric features; Learnable gating coefficients are introduced to adaptively weight and fuse the aggregated features with the voxel geometric features in the original sparse multimodal data to generate multimodal sparse voxel features.
5. The 3D semantic occupancy prediction method based on multimodal fusion according to claim 1, characterized in that, The static flow is predicted using the vehicle's motion transformation matrix to perform deterministic rigid body coordinate transformation, including: Based on the semantic classification results of the multimodal sparse voxel features at the current moment, voxel features belonging to the static background are separated. Obtain the motion transformation matrix of the vehicle between the current moment and the next moment; Rigid body transformation is performed on the coordinate index of the static background voxels using the vehicle motion transformation matrix, and the corresponding static background voxel features are mapped to the transformed coordinate positions through interpolation.
6. The 3D semantic occupancy prediction method based on multimodal fusion according to claim 1, characterized in that, The dynamic flow utilizes a neural network to predict the probabilistic evolution of motion trajectories, including: Based on the semantic classification results of the multimodal sparse voxel features at the current moment, voxel features belonging to the dynamic foreground are separated. The voxel features of the dynamic foreground of the historical frame are input into the temporal neural network to predict the displacement vector and feature evolution residual of each dynamic voxel at the next time step. The coordinates of the dynamic voxels are updated based on the displacement vector, and the voxel features of the corresponding dynamic foreground are updated using the feature evolution residual.
7. The 3D semantic occupancy prediction method based on multimodal fusion according to claim 1, characterized in that, The uncertainty decoding and arbitration based on predicted sparse features to generate the final 3D semantic occupancy map includes: Based on the predicted sparse features, the first evidence quantity guided by geometry and the second evidence quantity guided by semantics are generated by parallel decoding branches, and the cognitive uncertainty corresponding to each vector is calculated. For each voxel, the geometric occupancy state prediction and semantic category prediction results obtained based on the first evidence quantity and the second evidence quantity are compared; when the two evidence vectors are inconsistent in their determination of whether it is occupied, occupancy state arbitration is triggered, and the occupancy state corresponding to the vector with the lowest cognitive uncertainty is adopted; when both are determined to be occupied but the semantic categories are different, semantic category arbitration is triggered, and the semantic category corresponding to the vector with the higher total amount of semantic evidence is adopted. Based on the evidence vector after arbitration, the final semantic category and occupancy state of each voxel are determined, and a 3D semantic occupancy map with uncertainty information is generated.
8. A 3D semantic occupancy prediction system based on multimodal fusion, implemented using the 3D semantic occupancy prediction method based on multimodal fusion as described in any one of claims 1 to 7, characterized in that, include: Multimodal sparse coding module, cross-modal feature fusion module, two-stream world model prediction module, and uncertainty decoding arbitration module; The multimodal sparse coding module is used to map camera images and lidar point clouds to a unified three-dimensional space and perform sparse voxel coding preprocessing based on sparse hash index to obtain sparsed multimodal data. The cross-modal feature fusion module is used to extract and fuse features from sparse multimodal data using a cross-modal attention mechanism, and construct a unified multimodal sparse voxel feature. The dual-stream world model prediction module is used to input multimodal sparse voxel features into a dynamic-static separation dual-stream world model. The static stream uses the vehicle motion transformation matrix to perform deterministic rigid body coordinate transformation prediction, while the dynamic stream uses a neural network to perform probabilistic motion trajectory evolution prediction. Finally, the two prediction results are recombined in a unified coordinate system to generate sparse features for future time prediction. The uncertainty decoding and arbitration module is used to perform uncertainty decoding and arbitration based on predicted sparse features to generate the final 3D semantic occupancy map.
9. An electronic device comprising a memory and one or more processors, the memory for storing a computer program, characterized in that, The processor is used to implement the 3D semantic occupancy prediction method based on multimodal fusion as described in any one of claims 1 to 7 when executing a computer program.
10. A computer-readable storage medium storing a computer program thereon, characterized in that, When the computer program is executed by a computer, it implements the 3D semantic occupancy prediction method based on multimodal fusion as described in any one of claims 1 to 7.