Method, device and medium for constructing and completing semantics of dynamic and static occupancy of scene based on multi-modal data

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing scene dynamic and static occupancy semantics using multimodal data, and utilizing dense static scene modeling and dynamic obstacle point cloud completion, the problems of inaccurate positioning and incomplete dynamic obstacle modeling in existing technologies are solved, achieving higher-precision occupancy semantic grid production and improving the safety of autonomous driving.

CN122223318APending Publication Date: 2026-06-16JISHU TECHNOLOGY (WUHAN) CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: JISHU TECHNOLOGY (WUHAN) CO LTD
Filing Date: 2026-03-03
Publication Date: 2026-06-16

Application Information

Patent Timeline

03 Mar 2026

Application

16 Jun 2026

Publication

CN122223318A

IPC: G06V10/26; G06V20/70

AI Tagging

Application Domain

Character and pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing technologies suffer from inaccurate positioning during the production of occupancy semantic grids, leading to blurred static scenes and incomplete modeling of dynamic obstacles, which affects the accuracy of occupancy semantic grids.

⚗Method used

Multimodal data is used for scene modeling. By combining dense static scene modeling and dynamic obstacle point cloud completion with point cloud segmentation and diffusion models, accurate occupancy semantic raster ground values are generated.

🎯Benefits of technology

It achieves higher precision in static scene modeling and dynamic obstacle modeling, generates more accurate occupancy semantic raster ground truth, and improves the safety of autonomous driving.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122223318A_ABST

Patent Text Reader

Abstract

The application relates to the field of image segmentation, and discloses a scene dynamic-static occupancy semantic construction and completion method and device based on multi-modal data and a medium, which comprises the following steps: scene modeling based on a dense static scene to obtain complete static point cloud semantic labels; for point cloud data not belonging to the static scene, dynamic obstacle point cloud completion is carried out to restore complete obstacle point cloud; and the complete static point cloud and the complete obstacle point cloud are superimposed to generate an accurate occupancy semantic grid true value. The application optimizes static scene modeling and dynamic obstacle modeling, and can realize higher-precision occupancy semantic grid true value production.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image segmentation, and in particular to a method, device, and medium for constructing and completing scene dynamic and static occupancy semantics based on multimodal data. Background Technology

[0002] Occupancy semantic gratings have significant application value in autonomous driving tasks, serving as a bridge between perception and decision-making. They transform sensor data into rich spatial semantic representations, enabling vehicles to understand and plan their paths more safely. Recent research on occupancy semantic gratings has been extensive, largely based on deep learning. High-quality ground truth annotations for occupancy semantic gratings are crucial. Existing ground truth generation processes for occupancy semantic gratings are primarily based on LiDAR sequences and generally include the following steps: dynamic object detection (or segmentation), multi-frame overlay and fusion, point cloud semantic segmentation, scene modeling, and point cloud voxelization.

[0003] For static scene modeling, this process relies on high-precision positioning results to calculate inter-frame transformation relationships and perform point cloud frame stacking. Static scene modeling is achieved through multi-frame observation. If the positioning is inaccurate, the static scene will appear blurry and layered. Currently, mainstream positioning is often based on GPS+INS, which can have poor positioning accuracy in scenarios such as elevated roads and tunnels, leading to jumps during frame stacking and layering between frames. Additionally, cumulative errors can also cause blurring after frame stacking.

[0004] For dynamic obstacle modeling, representation is often limited to point clouds from a single frame or a superposition of several adjacent frames. This results in limited representation of dynamic obstacles. Firstly, LiDAR scans are incomplete; dynamic objects in the scene often cannot be fully scanned in relation to the vehicle's relative movement. Secondly, the image and LiDAR observation perspectives differ. Existing mainstream occupancy semantic grid prediction networks are often based on panoramic images. For the same object, differences in sensor installation location and sensor observation attributes lead to discrepancies between the image and LiDAR observations, resulting in inaccurate ground truth. Furthermore, the limited scanning range of LiDAR makes it difficult to detect or segment distant or small objects. These three issues restrict the quality of occupancy grids produced for dynamic obstacles. Summary of the Invention

[0005] The purpose of this invention is to propose a method, device, and medium for constructing and completing scene dynamic and static occupancy semantics based on multimodal data, thereby solving the technical problem that existing technologies are not accurate in the occupancy semantics production process.

[0006] Specifically, this invention provides a method for constructing and completing scene dynamic and static occupancy semantics based on multimodal data, including the following steps: S1. Perform scene modeling based on dense static scenes to obtain complete static point cloud semantic labels; S2. For point cloud data that is not static, perform dynamic obstacle point cloud completion to restore the complete obstacle point cloud; S3. Overlay the complete static point cloud with the complete point cloud of obstacles to generate accurate occupancy semantic raster ground truth.

[0007] A storage device that stores instructions and data for implementing a method for constructing and completing scene dynamic and static occupancy semantics based on multimodal data.

[0008] A scene dynamic and static occupancy semantic construction and completion device based on multimodal data includes: a processor and a storage device; the processor loads and executes instructions and data in the storage device to implement a scene dynamic and static occupancy semantic construction and completion method based on multimodal data.

[0009] The beneficial effects provided by this invention are: the static scene modeling and dynamic obstacle modeling have been optimized, enabling higher-precision production of occupancy semantic grid truth values. Attached Figure Description

[0010] Figure 1 This is a simplified schematic diagram of the method flow of the present invention; Figure 2 This is a schematic diagram of the hardware device used in this application. Detailed Implementation

[0011] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be further described below with reference to the accompanying drawings.

[0012] Before formally describing the present invention, a general description of the solution of the present invention will be given first to facilitate understanding.

[0013] Example 1: Please refer to the diagram. Figure 1 This is a schematic diagram of the method flow of the present invention; This invention provides a method for constructing and completing scene dynamic and static occupancy semantics based on multimodal data, comprising the following steps: S1. Perform scene modeling based on dense static scenes to obtain complete static point cloud semantic labels; It should be noted that step S1 is as follows: S11. Use the single-frame point cloud after precise positioning to recall the corresponding dense static scene map. It should be noted in advance that the input to the method of this invention mainly consists of a sequence of LiDAR point cloud frames with synchronized timestamps collected by the autonomous vehicle, and the high-precision pose (position and attitude) of each point cloud frame in the global coordinate system obtained through a fusion positioning system. In addition, a dense static scene map needs to be pre-constructed for the scene to be labeled. This map is generated by collecting multiple frames of point clouds of the scene under high-precision positioning (such as LiDAR SLAM) conditions when there are no dynamic obstacles or when dynamic obstacles have been removed, and then fusing and denoising them. It is a 3D point cloud or semantic map with centimeter-level geometric accuracy, containing only static elements (such as road surface, curb, buildings, fixed facilities, and vegetation). This map serves as a ground truth reference for the static scene.

[0014] More specifically, dense static scene maps are pre-constructed static, obstacle-free 3D point cloud maps or semantic maps using LiDAR SLAM technology, multi-sensor fusion mapping, or other high-precision surveying methods. They are generated by collecting data from the same scene under conditions of no vehicles or very few dynamic objects, and then performing high-precision stitching and denoising of multiple frame point clouds. Their storage format can be PCD, PLY, etc., and they possess centimeter-level or higher geometric accuracy and labeled (or associative) static semantic information (such as roads, buildings, and trees).

[0015] To implement this method, the required input data includes: Core perception data: time-synchronized LiDAR point cloud sequences and surround / forward camera images. Images are primarily used to assist in semantic segmentation of the point clouds (as described in S12 below).

[0016] Auxiliary sensing data: Millimeter-wave radar point clouds can be used to assist in the initial detection of dynamic obstacles in low-visibility or long-range scenarios, providing more robust target candidates for step S2. This data is optional.

[0017] Location data: Vehicle pose information provided by a high-precision fusion positioning system (such as combining laser SLAM, GNSS, and IMU). This positioning information needs to meet certain accuracy requirements to ensure that the single-frame point cloud can accurately recall the static map in step S11. Specifically, the "precise positioning" requires that the pose estimation error of the single-frame point cloud relative to the global coordinate system should typically have a translation error of less than 20 cm and a rotation error of less than 1 degree.

[0018] S12. After change detection and dynamic element removal, the single-frame point cloud is fused with the dense static scene map to obtain a complete semantic label for the single-frame static point cloud. It should be noted that this invention performs unified rasterization processing on the real-time acquired single-frame point cloud and the pre-established ground truth 3D map, and then performs voxel-level difference comparison (or difference operation) to efficiently identify and remove matching areas in the static environment, thereby accurately extracting newly added or missing changed objects in the environment.

[0019] The change detection described above is achieved through voxelization differential. Specifically, the current frame point cloud and the recalled dense static scene map are uniformly divided into a 3D voxel mesh with a side length of L (e.g., L = 0.1 meters). For each voxel, the number of points in the current frame point cloud within that voxel is calculated. And the number of points in the static map within that voxel Calculate the rate of change of points. .like Greater than the preset threshold (For example = 0.4), then the voxel is marked as a "change region". In order to filter out false detections caused by the brief stay of dynamic objects or sensor noise, only when a voxel is marked as a change in consecutive M frames (e.g., M=3) will it be finally determined as a real change in the static environment (e.g., the addition of obstacles) and removed from the point cloud used for static modeling in the current frame.

[0020] It should be noted that step S12 specifically employs a point cloud segmentation model to segment local point clouds in a dense static scene map, automatically obtaining static scene semantic labels. The point cloud segmentation model uses the SAM model. This invention uses points as prompt words to automatically segment static scene semantics, mainly including: buildings, trees, shrubs, signs, poles, etc. These are then fused with the semantics of the non-displayable changes in the static base map to produce complete static semantics.

[0021] Specifically, in step S12, the point cloud segmentation model can adopt an image-based SAM model, and the specific application method is as follows: First, using the pre-calibrated extrinsic parameter matrix between the vehicle-mounted LiDAR and the camera, the current frame point cloud is projected onto a time-synchronized panoramic image, generating sparse depth points or projection point sets associated with image pixels. Then, on the image, projection points belonging to static objects (such as building facades and tree trunks) are selected as prompts and input into the SAM model to obtain segmentation masks for these static objects on the image. Finally, through the inverse projection of the camera imaging model, these image masks are mapped back to 3D space, assigning static semantic labels to the corresponding points in the original point cloud. This process realizes semantic annotation of 3D point clouds using a 2D image segmentation model.

[0022] S13. Repeat steps S11 to S12 for other frame point clouds to obtain complete static point cloud semantic labels.

[0023] S2. For point cloud data that is not static, perform dynamic obstacle point cloud completion to restore the complete obstacle point cloud; Since change detection and dynamic feature removal were performed in step S12, this step is a process of completing the removed dynamic features.

[0024] Step S2 is as follows: S21. Construct a vehicle template library; It should be noted that the construction process of the vehicle template library mentioned in step S21 is as follows: It employs a 360° scan in advance, scanning vehicles of different shapes and sizes, including cars, SUVs, pickup trucks, trucks, and buses.

[0025] Specifically, the point clouds in the template library all contain vehicles scanned from 360 degrees, ensuring complete vehicle modeling. The template library includes vehicles of different brands and types, such as sedans, SUVs, pickup trucks, trucks, and buses, covering common vehicles in traffic scenarios.

[0026] In this invention, the vehicle template library is constructed offline through the following steps: (1) Using a high-precision 3D scanning device, various typical vehicles (sedans, SUVs, etc.) are scanned 360° to obtain complete and dense point clouds of the vehicle's exterior surface; (2) The scanned point clouds are denoised, downsampled, and normalized (e.g., translated to the origin of the coordinate system and unified the principal axis direction); (3) The processed standard complete vehicle point clouds are stored in the template library according to their categories. Each template is an independent point cloud file, and records the approximate size information of its original vehicle.

[0027] S22. A diffusion model is used to complete the incomplete point cloud data of vehicles that are not static, resulting in a completed point cloud. The diffusion model integrates PointMAE and LION, and achieves high-precision completion of incomplete point clouds through a two-stage structure of local completion and latent space diffusion reconstruction. It should be noted that this invention draws on the design concepts of LION: Latent Point Diffusion Models for 3D Shape Generation (hereinafter referred to as LION) and PointMAE: Masked Autoencoders for PointCloud Self-supervised Learning (hereinafter referred to as PointMAE). LION maps 3D point clouds to a latent feature space, including the overall shape space and the point structure space. Denoising diffusion models (DDMs) are trained in the latent feature space, and then decoded to recover point clouds or triangular meshes. PointMAE is a point cloud local reconstruction model based on masked autoencoders. This invention integrates PointMAE and LION, achieving high-precision completion of incomplete point clouds through a two-stage structure of "local completion - latent space diffusion reconstruction". In the first stage, PointMAE reconstructs the locally missing regions; in the second stage, LION performs latent space diffusion optimization on the completed point cloud, providing globally consistent shape priors, thereby obtaining a complete 3D point cloud with reasonable structure and fine local details.

[0028] Specifically, this step is achieved through a two-stage diffusion completion network that integrates the ideas of PointMAE and LION, ranging from local to global. The training data for this network consists of a large number of complete vehicle point clouds and locally incomplete vehicle point clouds generated by artificial simulation.

[0029] The first stage (encoding and local reconstruction): After normalizing the input vehicle defect cloud, it is fed into an encoder-decoder network based on the PointMAE architecture. This network learns through mask modeling and outputs a preliminary completed point cloud. P pre It can repair local defects, but the overall shape may not be regular enough.

[0030] Phase Two (Late Space Diffusion Optimization): ... P pre A pre-trained LION model encoder is input and mapped to a low-dimensional latent space vector z. Subsequently, a denoising diffusion probabilistic model (DDPM) trained on the latent space of the complete vehicle point cloud performs multi-step denoising optimization on z, generating an optimized latent vector that conforms to the prior knowledge of the complete vehicle shape. z opt Finally, the LION decoder will... z opt Decoding yields the final, complete, structurally sound, and detailed vehicle point cloud. P complete .

[0031] The two phases mentioned above require joint training.

[0032] When deploying the application, the following key parameters need to be set: PointMAE Local Reconstruction Stage: After normalizing the input incomplete point cloud, the preliminary completed point cloud is obtained directly through network forward propagation. P pre .

[0033] LION Latency Space Optimization Phase: [The following is a partial translation of the original text, which is not possible without further P pre After encoding into a latent vector z, sampling is performed using a pre-trained denoising diffusion model. The DDIM (Denoising Diffusion Implicit Models) sampler is used to balance efficiency and quality. The total number of denoising steps K is set to 25, and the sampling temperature parameter is 0.9. This optimization process yields... z opt Ultimately, it was decoded into a complete point cloud. P complete。

[0034] Once the key parameters are set, training can begin. The specific process is as follows: Use a subset of vehicle classes from publicly available large, complete 3D shape datasets (such as ShapeNet, Pascal3D+) or a self-built complete vehicle point cloud dataset as the training basis. For each complete vehicle point cloud model P complete (Containing N points), perform the following preprocessing to construct training pairs: Normalization: P complete Translate to the center of its bounding box and scale to within the unit sphere. Simulate incompleteness: Generate incomplete point clouds using a randomized strategy. P partial To simulate occlusion in real LiDAR scanning: Random viewpoint occlusion: Rendering a depth map from a random viewpoint and back-projecting it to simulate single-view scanning. Random block masking: Randomly discarding a certain percentage (e.g., 30%-70%) of continuous point clusters on the point cloud. Adding noise: Adding slight Gaussian noise to the point cloud to simulate sensor noise. This allows for the construction of a large number of training sample pairs. P partial , P complete .

[0035] Phase 1: Pre-training of the PointMAE local reconstruction model.

[0036] A Transformer-based mask autoencoder architecture is employed. The input is an incomplete point cloud. P partialThe model learns to predict the coordinates of points in the masked (i.e., missing) regions. This is done by normalizing... P partial Perform a higher percentage of random masking (e.g., 60%). The encoder processes the visible points to generate latent features. The decoder reconstructs the coordinates of the masked points based on the latent features and the mask token.

[0037] The loss function combines point-to-point distance and distribution similarity to supervise reconstruction quality, as detailed below: The loss function combines point-to-point distance and distribution similarity to supervise reconstruction quality.

[0038] in, This refers to the chamfer distance. Earth Mover's Distance. and To balance the weights.

[0039] chamfer distance The calculation formula is as follows:

[0040] bulldozer distance The calculation formula is as follows: in, It is a bijective mapping.

[0041] Phase 2: Pre-training of the LION global diffusion model.

[0042] The LION model consists of a variational autoencoder (VAE) and a diffusion model trained on the VAE latent space. Its goal is to learn a smooth distribution of the complete vehicle point cloud in a low-dimensional latent space.

[0043] VAE training: Using complete, normalized point clouds P complete As input, encoder E will... P complete The mapping is to a latent vector z. The decoder G reconstructs z into points. The system is trained using reconstruction loss (such as chamfer distance) and KL divergence loss to ensure that the latent space z conforms to a standard normal distribution and has decoding capability.

[0044] Training of the latent space diffusion model: The trained VAE encoder E is used to convert all complete training point clouds into latent vectors. z0.

[0045] A denoising diffusion probability model (DDPM) is trained on the latent space z. This model learns a denoising process that can remove random Gaussian noise. z T Gradually denoise to recover a latent vector distribution that conforms to the actual vehicle data. z 0.

[0046] The training objective of the diffusion model is to minimize the noise prediction error:

[0047] in, t For diffusion time step, Added Gaussian noise, The noise reduction network to be trained.

[0048] Joint training and fine-tuning: To enable the two-stage models to work together, end-to-end joint fine-tuning is performed after separate pre-training: Forward pipeline: Input P partial The process proceeds sequentially through the PointMAE model, outputting a preliminary completed point cloud. The VAE encoder of the LION model will... Encoded as latent vector LION's diffusion model sampler (which acts as a latent space "optimizer") is used for... Perform a small number of iterative denoising steps (e.g., 10 steps) to obtain the optimized latent vector. LION's VAE decoder will Decode to the final output point cloud .

[0049] Loss function and training: The goal of joint training is to optimize the final output. As close as possible to a real, complete point cloud P complete The total loss function is:

[0050] in, Measure the accuracy of the final output. It is a shape constraint loss, which is calculated as follows: and P complete Global features are extracted from both a pre-trained, parameter-frozen 3D shape classification network (such as PointNet), and the cosine similarity loss of their feature vectors is calculated.

[0051] Here, f() represents the feature extractor of the shape classification network. These are the balancing coefficients. At this stage, the main focus is on fine-tuning the PointMAE decoder and the LION VAE encoder / decoder, while the parameters of the diffusion model are typically kept frozen to maintain its learned strong shape priors.

[0052] Through the aforementioned phased training strategy, the diffusion completion model can effectively learn general prior knowledge of vehicle shape and master the ability to repair any incomplete point cloud into a complete and reasonable vehicle model, thereby providing high-quality completed point clouds for subsequent template matching.

[0053] S23. Match the completed point cloud with the vehicle template library, and use the matching result as the complete point cloud of the vehicle. It should be noted that this invention uses the Iterative Nearest Point (ICP) algorithm or a nearest neighbor search based on global descriptors (such as FPHF) to find the complete vehicle template most similar to the completed point cloud in the template library. The matching criterion is to minimize the distance error between point clouds.

[0054] Specifically, the Iterative Closest Point (ICP) algorithm is used for matching. The matching process iterates until convergence or the maximum number of iterations (e.g., 100) is reached. After matching is completed, the root mean square error (RMSE) of the final registration is calculated. If the RMSE is less than a preset threshold Δ (e.g., Δ = 0.08 meters), the matching is considered successful, and the matched template is used as the complete point cloud of the vehicle; otherwise, the matching is considered a failure, and a default category template that is closest to the current vehicle's bounding box size can be selected.

[0055] S24. Adjust the complete point cloud of the vehicle according to its size and orientation.

[0056] It should be noted that the adjustment described in step S24 is as follows: the vehicle is placed in the corresponding position in the local point cloud map according to its center point position.

[0057] Specifically, first, the three-dimensional axial bounding box (OBB) is calculated based on the original dynamic obstacle point cloud cluster to obtain the vehicle's current position, orientation (yaw angle), and approximate size. Then, the matched complete point cloud template is scaled (to match the size) and rotated (to match the orientation) accordingly. Finally, it is translated so that the center of the template coincides with the center of the original point cloud cluster, thus obtaining a complete vehicle point cloud that fits the actual scene position, size, and orientation.

[0058] S3. Overlay the complete static point cloud with the complete point cloud of obstacles to generate accurate occupancy semantic raster ground truth.

[0059] It should be noted that step S3 is as follows: The complete static point cloud is overlaid with the complete point cloud of the obstacle, aligned and voxelized to generate a raster map with semantics. Visual analysis is performed on the raster map to obtain the final accurate scene dynamic and static occupancy semantic raster ground truth labeling results.

[0060] Specifically, the voxelization process employs a uniform 3D mesh with a preset resolution. Based on the requirements of autonomous driving perception tasks, the grid resolution is typically set between 0.1 meters and 0.5 meters, preferably 0.2 meters. That is, the superimposed complete static and dynamic point cloud space is divided into cubic voxels with a side length of 0.2 meters. The semantic information of points within each voxel is statistically analyzed and encoded into occupancy status (occupied / idle) and main semantic categories, thereby generating accurate, semantically meaningful 3D occupancy semantic roxel ground truth values.

[0061] Example 2: Please see Figure 2 , Figure 2 This is a schematic diagram of the hardware device in operation according to an embodiment of the present invention. The hardware device specifically includes: a scene dynamic and static occupancy semantic construction and completion device 401 based on multimodal data, a processor 402, and a storage device 403.

[0062] A scene dynamic and static occupancy semantic construction and completion device 401 based on multimodal data: The scene dynamic and static occupancy semantic construction and completion device 401 based on multimodal data implements the scene dynamic and static occupancy semantic construction and completion method based on multimodal data.

[0063] Processor 402: The processor 402 loads and executes the instructions and data in the storage device 403 to implement the scene dynamic and static occupancy semantic construction and completion method based on multimodal data.

[0064] Storage device 403: The storage device 403 stores instructions and data; the storage device 403 is used to implement the scene dynamic and static occupancy semantic construction and completion method based on multimodal data.

[0065] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for constructing and completing scene dynamic and static occupancy semantics based on multimodal data, characterized in that: The method includes the following steps: S1. Perform scene modeling based on dense static scenes to obtain complete static point cloud semantic labels; S2. For point cloud data that is not static, perform dynamic obstacle point cloud completion to restore the complete obstacle point cloud; S3. Overlay the complete static point cloud with the complete point cloud of obstacles to generate accurate occupancy semantic raster ground truth.

2. The method for constructing and completing scene dynamic and static occupancy semantics based on multimodal data as described in claim 1, characterized in that: Step S1 is as follows: S11. Use the single-frame point cloud after precise positioning to recall the corresponding dense static scene map. S12. After change detection and dynamic element removal, the single-frame point cloud is fused with the dense static scene map to obtain a complete semantic label for the single-frame static point cloud. S13. Repeat steps S11 to S12 for other frame point clouds to obtain complete static point cloud semantic labels.

3. The method for constructing and completing scene dynamic and static occupancy semantics based on multimodal data as described in claim 2, characterized in that: In step S12, a point cloud segmentation model is used to segment the local point cloud in the dense static scene map and automatically obtain the static scene semantic labels.

4. The method for constructing and completing scene dynamic and static occupancy semantics based on multimodal data as described in claim 1, characterized in that: Step S2 is as follows: S21. Construct a vehicle template library; S22. A diffusion model is used to complete the incomplete point cloud data of vehicles that are not static, resulting in a completed point cloud. The diffusion model integrates PointMAE and LION, and achieves high-precision completion of incomplete point clouds through a two-stage structure of local completion and latent space diffusion reconstruction. S23. Match the completed point cloud with the vehicle template library, and use the matching result as the complete point cloud of the vehicle. S24. Adjust the complete point cloud of the vehicle according to its size and orientation.

5. The method for constructing and completing scene dynamic and static occupancy semantics based on multimodal data as described in claim 4, characterized in that: The process of constructing the vehicle template library in step S21 is as follows: It employs a 360° scan in advance, scanning vehicles of different shapes and sizes, including cars, SUVs, pickup trucks, trucks, and buses.

6. The method for constructing and completing scene dynamic and static occupancy semantics based on multimodal data as described in claim 1, characterized in that: The adjustment described in step S24 is as follows: Place the vehicle at the corresponding position in the local point cloud map according to the center point position of the vehicle.

7. The method for constructing and completing scene dynamic and static occupancy semantics based on multimodal data as described in claim 1, characterized in that: Step S3 is as follows: The complete static point cloud is overlaid with the complete point cloud of the obstacle, aligned and voxelized to generate a raster map with semantics. Visual analysis is performed on the raster map to obtain the final accurate scene dynamic and static occupancy semantic raster ground truth labeling results.

8. A storage device, characterized in that: The storage device stores instructions and data to implement the scene dynamic and static occupancy semantic construction and completion method based on multimodal data as described in any one of claims 1 to 7.

9. A scene dynamic and static occupancy semantic construction and completion device based on multimodal data, characterized in that: include: A processor and a storage device; the processor loads and executes instructions and data in the storage device to implement the scene dynamic and static occupancy semantic construction and completion method based on multimodal data as described in any one of claims 1 to 7.