A kind of laser radar, inertial and vision fusion three-dimensional accurate positioning and semantic mapping method and system based on segmentation guidance

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By incorporating semantic segmentation masks and cross-frame semantic consistency constraints into extended Kalman filtering, the problem of insufficient localization robustness of LiDAR, inertial, and vision fusion SLAM systems in dynamic environments and ambiguous scenarios is solved, achieving high-precision semantic mapping and pose estimation, which is suitable for complex environment applications of autonomous robots and autonomous vehicles.

CN122265619APending Publication Date: 2026-06-23HARBIN INST OF TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: HARBIN INST OF TECH
Filing Date: 2026-04-20
Publication Date: 2026-06-23

Application Information

Patent Timeline

20 Apr 2026

Application

23 Jun 2026

Publication

CN122265619A

IPC: G06V10/24; G06V10/26; G06V10/40; G06V10/762; G06V10/80; G06V10/82; G06T17/05

AI Tagging

Application Domain

Character and pattern recognition 3D modelling

Technology Topics

Visual perception Lidar

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122265619A_ABST

Patent Text Reader

Abstract

The application relates to a kind of laser radar, inertial and vision fusion three-dimensional accurate positioning and semantic mapping method and system based on segmentation guidance, and relates to the field of robot autonomous positioning and environment modeling.The defects that the existing laser radar, inertial, vision fusion SLAM system is insufficient in positioning robustness in dynamic environment, ambiguous scene, and loose in semantic information utilization are solved, a kind of laser radar, inertial, vision fusion three-dimensional accurate positioning and semantic mapping method based on segmentation guidance is provided, semantic segmentation mask depth is deeply integrated into the core process of extended Kalman filter state estimation, through semantic weighted feature selection and cross-frame semantic consistency constraint, high-precision, high-robustness three-dimensional positioning and semantic mapping in complex scene are realized.The application is also applicable to real-time pose estimation and three-dimensional semantic map construction in complex dynamic environment for autonomous robot, unmanned aerial vehicle, automatic driving vehicle and other application fields.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robot autonomous localization and environmental modeling technology, specifically to a method and system for precise 3D localization and semantic mapping based on segmentation-guided LiDAR, inertial and visual fusion. Background Technology

[0002] Simultaneous Localization and Mapping (SLAM) is a core technology for autonomous robots and intelligent systems to achieve autonomous movement in unknown and complex environments. In recent years, multi-sensor fusion SLAM based on LiDAR, vision, and inertial measurement units has become the mainstream direction for improving positioning accuracy and robustness due to its ability to integrate the complementary advantages of different modal sensors.

[0003] Existing lidar, inertial, and visual odometry systems mostly rely on geometric constraints, photometric consistency, or low-level feature correlations for state estimation, which has significant shortcomings in practical applications. 1. Dynamic objects and cluttered environment interference: Dynamic targets, temporary occlusions and cluttered objects in the scene can disrupt the stability of data association, leading to pose estimation drift or even tracking failure; 2. Perceptual confusion and environmental ambiguity: In environments with repetitive structures, weak textures, or simple geometric features, relying solely on geometric matching can easily lead to mismatches, resulting in perceptual failure and localization errors. 3. Insufficient utilization of semantic information: Existing semantic SLAM usually uses semantic segmentation results as post-processing or auxiliary annotation methods, without deeply integrating semantic constraints into the core process of state estimation. The potential of semantic information to improve localization robustness, temporal consistency and global map correction has not been fully explored.

[0004] Although existing technologies have attempted to combine semantic segmentation with SLAM, they generally employ a loosely coupled approach. Semantic information is not involved in core state estimation processes such as extended Kalman filtering, making it impossible to correct geometric errors through semantic consistency constraints. Furthermore, traditional feature point selection relies solely on geometric smoothness, without incorporating semantic category reliability for adaptive weighting, making it difficult to suppress interference from dynamic and low-reliability features. Summary of the Invention

[0005] To overcome the shortcomings of existing SLAM systems that fuse LiDAR, inertial, and vision in dynamic environments and ambiguous scenarios, such as insufficient robustness in localization and loose utilization of semantic information, this invention provides a segmentation-guided LiDAR, inertial, and vision fusion-based method for precise 3D localization and semantic mapping. This method deeply integrates semantic segmentation masks into the core process of extended Kalman filter state estimation. Through semantically weighted feature selection and cross-frame semantic consistency constraints, it achieves high-precision and robust 3D localization and semantic mapping in complex scenarios. To this end, this invention proposes a segmentation-guided LiDAR, inertial, and vision fusion-based method and system for precise 3D localization and semantic mapping. The invention achieves this through the following technical solutions to solve the aforementioned technical problems: Option 1: This invention proposes a method for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial, and visual fusion. The method includes the following steps: Step 1: Acquire multi-source sensor synchronization data and generate semantic masks to obtain the category probability distribution of each pixel; Step 2: Based on the class probability distribution of each pixel and the angular velocity and linear acceleration measured by the IMU, construct a continuous-time error state dynamic model, obtain the predicted pose, velocity and deviation at the current moment through state propagation, and calculate the prediction error covariance matrix. Step 3: Project the 3D points of the LiDAR onto the visual image plane through the camera intrinsic parameter matrix and extrinsic parameter transformation matrix to obtain the corresponding pixel coordinates. Based on the pixel coordinates, establish the spatial correspondence between the 3D points and the 2D semantic mask. Design a semantic weighted feature point selection mechanism, and combine geometric smoothness and semantic category reliability to allocate differentiated sampling weights to distinguish edge features and planar feature candidate points. Step 4: For the 3D points of the LiDAR with geometric correspondence between consecutive frames, calculate the KL divergence of their semantic distribution as the semantic consistency error; construct a robust semantic constraint factor based on the Huber kernel; Step 5: Stack the geometric observation residuals, semantic observation residuals, and cross-frame semantic consistency residuals of the lidar together to construct the joint observation vector of the extended Kalman filter; through the block sequential update strategy, complete the tight coupling fusion of geometric and semantic information and output the optimal pose estimate. Step 6: Use a global voxel grid structure to construct a semantically enhanced 3D map, and use an exponential moving average strategy to update the voxel centers and semantic distribution to generate a spatiotemporally stable and semantically rich 3D voxel map.

[0006] Furthermore, a preferred embodiment is provided, wherein the method for obtaining the category probability distribution of each pixel in step 1 is as follows:

[0007] Each type of probability satisfies a normalization constraint:

[0008] Based on the above probability distribution, a semantic mask corresponding one-to-one with the image is generated. This completes the preprocessing of multi-source data and the extraction of semantic priors.

[0009] Furthermore, a preferred embodiment is provided, wherein the method for constructing a continuous-time error state dynamic model in step 2 based on the category probability distribution of each pixel and the angular velocity and linear acceleration measured by the IMU is as follows: Angular velocity based on IMU measurement With linear acceleration Establish a transfer model between real physical quantities and measured values:

[0010]

[0011] In the formula, The rotation matrix corresponding to the quaternion. The vector of gravitational acceleration. , Zero-mean Gaussian measurement noise; The continuous-time error state dynamics satisfy a linear differential equation:

[0012] matrix The format is: .

[0013] Furthermore, a preferred embodiment is provided, wherein the method for obtaining the predicted pose, velocity, and deviation at the current moment through state propagation in step 2, and calculating the prediction error covariance matrix is as follows: Calculate the discrete-time state transition matrix using matrix exponentiation. :

[0014] System state prediction based on the transition matrix:

[0015] Simultaneous calculation of the prediction error covariance matrix:

[0016] In the formula, The output is the IMU predicted pose, velocity, zero bias, and corresponding error covariance, which is the noise covariance of the discretization process.

[0017] Furthermore, a preferred embodiment is provided, in which step 3 further includes a step of using a dual-threshold strategy to filter feature points, that is, when the weight is greater than the high threshold, it is considered an edge feature, and when it is less than the low threshold, it is considered a planar feature, and intermediate points are discarded. Finally, the filtered feature points are used to construct geometric residuals and semantic residuals. .

[0018] Furthermore, a preferred embodiment is provided, wherein the method for constructing robust semantic constraint factors based on the Huber kernel in step 4 is as follows:

[0019] In the formula, The semantic consistency threshold is used to derive the overall semantic consistency cost function:

[0020] Linearizing the semantic constraints yields the corresponding Jacobian matrix. With residuals This is then incorporated into the update process of the extended Kalman filter.

[0021] Furthermore, a preferred embodiment is provided, wherein the method for updating voxel centers and semantic distribution using an exponential moving average strategy in step 6 to generate a spatiotemporally stable and semantically rich 3D voxel map is as follows:

[0022] In the formula, It is a forgetting factor.

[0023] Option 2: A 3D precise positioning and semantic mapping system based on segmentation-guided lidar, inertial and visual fusion, the system comprising: The data preprocessing module is used to acquire synchronized data from multiple sensors and generate semantic masks to obtain the category probability distribution of each pixel. The IMU state propagation module is used to construct a continuous-time error state dynamics model based on the class probability distribution of each pixel and the angular velocity and linear acceleration measured by the IMU. It obtains the predicted pose, velocity and deviation at the current moment through state propagation and calculates the prediction error covariance matrix. The semantic feature selection module projects the 3D points of the LiDAR onto the visual image plane through the camera intrinsic parameter matrix and extrinsic parameter transformation matrix to obtain the corresponding pixel coordinates. Based on the pixel coordinates, a spatial correspondence between the 3D points and the 2D semantic mask is established. A semantic weighted feature point selection mechanism is designed, which combines geometric smoothness and semantic category reliability to allocate differentiated sampling weights and distinguishes edge feature and planar feature candidate points. A semantic constraint construction module is used to calculate the KL divergence of the semantic distribution of 3D LiDAR points with geometric correspondence between consecutive frames as the semantic consistency error; a robust semantic constraint factor is constructed based on the Huber kernel. The joint state estimation module is used to jointly stack the geometric observation residuals, semantic observation residuals, and cross-frame semantic consistency residuals of the lidar to construct the joint observation vector of the extended Kalman filter; through the block sequential update strategy, the tight coupling fusion of geometric and semantic information is completed to output the optimal pose estimate. The semantic map construction module is used to construct semantically enhanced 3D maps using a global voxel grid structure. It uses an exponential moving average strategy to update voxel centers and semantic distribution, generating a spatiotemporally stable and semantically rich 3D voxel map.

[0024] Option 3: A computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the method described in Option 1.

[0025] Option 4: A computer device, including a memory and a processor, wherein the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the method described in Option 1.

[0026] The advantages of this invention are: The present invention provides a method and system for precise 3D positioning and semantic mapping based on segmentation-guided lidar, inertial and visual fusion, which significantly enhances the anti-dynamic interference capability. Through semantic weighted feature selection, it effectively suppresses the damage to data association caused by dynamic objects and cluttered backgrounds, resulting in more stable pose estimation and less likelihood of tracking failure.

[0027] The present invention provides a method and system for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial and visual fusion, which significantly improves localization accuracy: the average localization error (ATE) is as low as 0.0290m on the NTU-VGA public dataset, which is about 9% better than the current best method FAST-LIVO2. It can also avoid perceptual confusion and mismatch in weak texture and repetitive structure scenes.

[0028] The present invention describes a method and system for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial and visual fusion, and deep utilization of semantic information: semantic masking is tightly coupled into EKF state estimation, rather than just as a post-processing aid, which significantly improves temporal consistency and global map correction capabilities.

[0029] The present invention provides a method and system for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial and visual fusion, which outputs richer information: it can directly generate a 3D voxel map with semantic annotation, providing high-quality basic data for subsequent scene understanding, target tracking and autonomous decision-making.

[0030] This invention is also applicable to applications such as real-time pose estimation and 3D semantic map construction for autonomous robots, drones, and self-driving vehicles in complex dynamic environments. Attached Figure Description

[0031] Fig. 1 This is a schematic diagram of the overall framework of the segmentation-guided lidar, inertial and visual fusion-based 3D precise positioning and semantic mapping method described in Implementation Method 1.

[0032] Fig. 2 This is a schematic diagram comparing the semantic enhancement 3D reconstruction effects described in Implementation Method 1.

[0033] Fig. 3 This diagram illustrates a comparison of the positioning accuracy of the method described in this invention with that of existing technologies under the same conditions. Detailed Implementation

[0034] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them.

[0035] Implementation Method 1, see [link] Figs. 1 to 3 This embodiment describes a method for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial, and visual fusion. The method specifically includes the following steps: S1: Acquire synchronized data from multiple sensor sources and generate a semantic mask. In S1, to acquire synchronized data from multiple sensors, you can choose LiDAR, a six-axis attitude sensor (IMU), or an RGB vision camera as environmental perception and state acquisition devices. First, perform extrinsic parameter calibration and timestamp alignment on the LiDAR, IMU, and camera to obtain strictly spatiotemporally synchronized LiDAR point cloud data. IMU measurement data , and visual image data Using the SAM3 universal segmentation model to segment visual images Perform pixel-by-pixel semantic segmentation for the image domain. Each pixel is allocated Class semantic probability distribution The probability vector satisfies:

[0036] Each type of probability satisfies a normalization constraint:

[0037] Based on the above probability distribution, a semantic mask corresponding one-to-one with the image is generated. This completes the preprocessing of multi-source data and the extraction of semantic priors.

[0038] S2: Attitude Sensor State Prediction and Error Covariance Calculation In S2, the system state space is established using the attitude sensor (IMU) as the body reference, and the system state vector is defined. for:

[0039] Angular velocity based on IMU measurement With linear acceleration Establish a transfer model between real physical quantities and measured values:

[0040]

[0041] In the formula, The rotation matrix corresponding to the quaternion. The vector of gravitational acceleration. , Zero-mean Gaussian measurement noise.

[0042] The continuous-time error state dynamics satisfy a linear differential equation:

[0043] System Matrix The format is as follows:

[0044] Calculate the discrete-time state transition matrix using matrix exponentiation. :

[0045] System state prediction based on the transition matrix:

[0046] Simultaneous calculation of the prediction error covariance matrix:

[0047] In the formula, This is used to discretize the noise covariance during the process. The final output includes the IMU's predicted pose, velocity, zero bias, and corresponding error covariance, providing prior constraints for subsequent EKF joint state estimation.

[0048] S3: Semantic Projection of Coordinate Points and Feature Selection In S3, the three-dimensional points of the LiDAR will be... The corresponding pixel coordinates are obtained by projecting the camera intrinsic parameter matrix K and the extrinsic parameter transformation matrix onto the image plane:

[0049] Based on this, a spatial correspondence between three-dimensional geometric points and two-dimensional semantic masks is established.

[0050] For the projected pixels Take its semantic probability distribution And through the category confidence matrix The semantic observation vector is obtained by weighting:

[0051] In the formula, This indicates element-wise multiplication.

[0052] For each point in the lidar point cloud Calculate the local smoothness index:

[0053] Normalized smoothness ratio is And further calculate the normalized smoothness score:

[0054] Construct semantically weighted sampling weights:

[0055] In the formula, The dominant semantic category for this point. As the basic weight of the category, This is the sensitivity coefficient.

[0056] A dual-threshold strategy is used to filter feature points: those with a weight greater than the higher threshold are considered edge features, while those with a weight less than the lower threshold are considered planar features, and intermediate points are discarded. Finally, the filtered feature points are used to construct geometric and semantic residuals.

[0057] This provides semantic constraints for subsequent EKF joint state estimation.

[0058] S4: Construction of Cross-Frame Semantic Consistency Constraints In S4, to ensure the temporal consistency of semantic tags between consecutive frames, cross-frame semantic consistency constraints are constructed for the 3D spatial points corresponding to geometric matching. Let the same 3D point be in the... Frame and the The projected pixel coordinates of the frame are respectively and The corresponding semantic probability distribution is and The semantic consistency error is defined as the KL divergence between the two:

[0059] A robust weight function is constructed based on the Huber kernel function to apply weighted constraints to semantic consistency errors:

[0060] In the formula, This represents the semantic consistency threshold. From this, we obtain the overall semantic consistency cost function:

[0061] Linearizing the semantic constraints described above yields the corresponding Jacobian matrix. With residuals This is then incorporated into the update process of the extended Kalman filter to enhance the temporal stability and semantic consistency of state estimation.

[0062] S5: Joint State Estimation under the EKF Framework In S5, within the error state extended Kalman filter framework, the system's joint state estimation is achieved by fusing lidar geometric observations, semantic observations, and cross-frame semantic consistency constraints. Let the posterior state estimate of the previous time step be... The corresponding error covariance is Then the predicted state and the predicted covariance at the current moment are:

[0063] The geometric and semantic residuals of a batch of feature points are stacked to form a joint observation residual vector. :

[0064] in, For point-to-plane / point-to-line geometric residuals, Let be the semantic observation residual. The corresponding joint Jacobian matrix and observation noise covariance are:

[0065] Calculate the Kalman gain:

[0066] Execution status update and covariance update:

[0067] A block-sequential update strategy is adopted to handle geometric and semantic heterogeneous observations, solve the dimension mismatch problem, and finally output the optimal pose, velocity and zero-bias estimate of the system, realizing the tight coupling and fusion of geometric and semantic information.

[0068] S6: Semantic Enhancement 3D Voxel Map Construction In S6, a global voxel grid structure is used to construct semantically enhanced 3D maps. The voxel map mapping relationship is represented as follows:

[0069] Each voxel unit Used to store semantically labeled 3D point information. Map points are stored in a semantically enhanced form, including location, normal vector, semantic confidence, and observation timestamp, in the following format:

[0070] In the formula, For point cloud locations, For the surface normal vector, This is a semantic confidence vector. This is the observation timestamp.

[0071] Based on voxel resolution, the lidar points are... Mapped to the corresponding voxel unit:

[0072] An exponential moving average strategy is used to update voxel centers and semantic distribution online, achieving dynamic map optimization and maintaining spatiotemporal consistency.

[0073] In the formula, This is a forgetting factor used to balance the temporal stability and spatial adaptability of a map.

[0074] The final output is a 3D voxel map containing multi-dimensional information including location, geometry, and semantics. Fig. 2 Column 5 shows exemplary results of semantic map construction, providing unified data support for autonomous robot localization, scene understanding, and path planning.

[0075] See Fig. 3This embodiment describes the effectiveness and advancement of the proposed segmentation-guided LiDAR, inertial, and visual fusion-based 3D localization and semantic mapping method. Quantitative testing and comparative experiments were conducted using the NTU-VGA public dataset. The method was compared with mainstream SLAM methods such as A-LOAM, SDV-LOAM, VINS-Mono, VINS-Fusion, LIO-SAM, and FAST-LIVO2 under the same environment, using average trajectory error (ATE) as the core evaluation metric. The experimental results are as follows: Fig. 3 As shown, the optimal method for each of the 9 test sequences is highlighted in dark green, and the suboptimal method is highlighted in light green.

[0076] from Fig. 3 As can be seen, the method of the present invention achieves the best positioning accuracy on multiple test sequences, with an average positioning error (ATE) as low as 0.0290m, which is about 9% higher than the current best method FAST-LIVO2. It achieves the best results in 8 out of 9 test sequences and the second best results in 1 sequence. The overall positioning performance is significantly better than the existing laser, inertial and vision fusion SLAM methods.

[0077] Meanwhile, this invention can effectively suppress interference from dynamic objects and mismatches in ambiguous scenarios through semantic masking guidance, semantic weighted feature selection, and cross-frame semantic consistency constraints. The pose estimation is more stable, and it can generate three-dimensional voxel maps with rich semantic information and strong spatiotemporal consistency, which can meet the real-time positioning and mapping needs of autonomous robots, drones, and autonomous vehicles in complex dynamic environments.

[0078] Implementation Method 2: This implementation method proposes a segmentation-guided lidar, inertial, and visual fusion-based 3D precise positioning and semantic mapping system, the system comprising: The data preprocessing module is configured to perform time synchronization and calibration of LiDAR, IMU, and visual images, and generate visual image semantic masks through the SAM3 model. The IMU state propagation module is configured to perform state prediction based on IMU measurement data and output the system prior pose and error covariance. The semantic feature selection module is configured to realize the projection association from LiDAR points to semantic masks and to filter stable geometric features through a semantic weighting mechanism. The semantic constraint building module is configured to calculate cross-frame semantic consistency error and build robust semantic constraint factors. The joint state estimation module is configured to fuse geometric and semantic observations within the EKF framework to achieve optimal system state updates. The semantic map building module is configured to maintain a semantically enhanced 3D voxel map, enabling the integrated storage and updating of semantic and geometric information.

[0079] In summary, this implementation upgrades semantic segmentation masks from a post-processing auxiliary means to a core constraint for state estimation. Through mask-guided feature weighting and cross-frame semantic consistency, it achieves tight coupling and fusion of geometry and semantics, significantly improving the accuracy and robustness of localization and mapping in dynamic environments, repetitive structures, and weakly textured scenes, while maintaining real-time computational efficiency.

[0080] Those skilled in the art will understand that the above description is merely a preferred embodiment of the present invention, and the features described in the various embodiments and / or claims of this disclosure can be combined or combined in various ways, even if such combinations or combinations are not explicitly described in this disclosure. This is not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

[0081] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention. Clearly, those skilled in the art can make various alterations and modifications to the invention without departing from its spirit and scope. Thus, if these modifications and modifications of the invention fall within the scope of the claims and their equivalents, the invention is also intended to include these modifications and modifications.

Claims

1. A method for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial, and visual fusion, characterized in that, The method includes the following steps: Step 1: Acquire multi-source sensor synchronization data and generate semantic masks to obtain the category probability distribution of each pixel; Step 2: Based on the class probability distribution of each pixel and the angular velocity and linear acceleration measured by the IMU, construct a continuous-time error state dynamic model, obtain the predicted pose, velocity and deviation at the current moment through state propagation, and calculate the prediction error covariance matrix. Step 3: Project the 3D points of the LiDAR onto the visual image plane through the camera intrinsic parameter matrix and extrinsic parameter transformation matrix to obtain the corresponding pixel coordinates. Based on the pixel coordinates, establish the spatial correspondence between the 3D points and the 2D semantic mask. Design a semantic weighted feature point selection mechanism, and combine geometric smoothness and semantic category reliability to allocate differentiated sampling weights to distinguish edge features and planar feature candidate points. Step 4: For the 3D points of the LiDAR with geometric correspondence between consecutive frames, calculate the KL divergence of their semantic distribution as the semantic consistency error; construct a robust semantic constraint factor based on the Huber kernel; Step 5: Stack the geometric observation residuals, semantic observation residuals, and cross-frame semantic consistency residuals of the lidar together to construct the joint observation vector of the extended Kalman filter; through the block sequential update strategy, complete the tight coupling fusion of geometric and semantic information and output the optimal pose estimate. Step 6: Use a global voxel grid structure to construct a semantically enhanced 3D map, and use an exponential moving average strategy to update the voxel centers and semantic distribution to generate a spatiotemporally stable and semantically rich 3D voxel map.

2. The method for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial, and visual fusion according to claim 1, characterized in that, The method for obtaining the class probability distribution of each pixel in step 1 is as follows: Each type of probability satisfies a normalization constraint: Based on the above probability distribution, a semantic mask corresponding one-to-one with the image is generated. This completes the preprocessing of multi-source data and the extraction of semantic priors.

3. The method for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial, and visual fusion according to claim 1, characterized in that, The method for constructing a continuous-time error state dynamics model based on the class probability distribution of each pixel and the angular velocity and linear acceleration measured by the IMU in step 2 is as follows: Angular velocity based on IMU measurement With linear acceleration Establish a transfer model between real physical quantities and measured values: In the formula, The rotation matrix corresponding to the quaternion. The vector of gravitational acceleration. , Zero-mean Gaussian measurement noise; The continuous-time error state dynamics satisfy a linear differential equation: matrix The format is: 。 4. The method for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial, and visual fusion according to claim 1, characterized in that, In step 2, the method for obtaining the predicted pose, velocity, and deviation at the current moment through state propagation and calculating the prediction error covariance matrix is as follows: Calculate the discrete-time state transition matrix using matrix exponentiation. : System state prediction based on the transition matrix: Simultaneous calculation of the prediction error covariance matrix: In the formula, The output is the IMU predicted pose, velocity, zero bias, and corresponding error covariance, which is the noise covariance of the discretization process.

5. The method for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial, and visual fusion according to claim 1, characterized in that, Step 3 also includes a step of using a dual-threshold strategy to filter feature points. That is, features with a weight greater than the high threshold are considered edge features, while those with a weight less than the low threshold are considered planar features. Intermediate points are discarded. Finally, the filtered feature points are used to construct geometric and semantic residuals. 。 6. The method for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial, and visual fusion according to claim 1, characterized in that, The method for constructing robust semantic constraint factors based on the Huber kernel in step 4 is as follows: In the formula, The semantic consistency threshold is used to derive the overall semantic consistency cost function: Linearizing the semantic constraints yields the corresponding Jacobian matrix. With residuals This is then incorporated into the update process of the extended Kalman filter.

7. The method for precise 3D localization and semantic mapping based on segmentation-guided lidar, inertial, and visual fusion according to claim 1, characterized in that, Step 6 employs an exponential moving average strategy to update voxel centers and semantic distribution, generating a spatiotemporally stable and semantically rich 3D voxel map. In the formula, It is a forgetting factor.

8. A three-dimensional precise positioning and semantic mapping system based on segmentation-guided lidar, inertial and visual fusion, characterized in that, The system includes: The data preprocessing module is used to acquire synchronized data from multiple sensors and generate semantic masks to obtain the category probability distribution of each pixel. The IMU state propagation module is used to construct a continuous-time error state dynamics model based on the class probability distribution of each pixel and the angular velocity and linear acceleration measured by the IMU. It obtains the predicted pose, velocity and deviation at the current moment through state propagation and calculates the prediction error covariance matrix. The semantic feature selection module projects the 3D points of the LiDAR onto the visual image plane through the camera intrinsic parameter matrix and extrinsic parameter transformation matrix to obtain the corresponding pixel coordinates. Based on the pixel coordinates, a spatial correspondence between the 3D points and the 2D semantic mask is established. A semantic weighted feature point selection mechanism is designed, which combines geometric smoothness and semantic category reliability to allocate differentiated sampling weights and distinguishes edge feature and planar feature candidate points. A semantic constraint construction module is used to calculate the KL divergence of the semantic distribution of 3D LiDAR points with geometric correspondence between consecutive frames as the semantic consistency error; a robust semantic constraint factor is constructed based on the Huber kernel. The joint state estimation module is used to jointly stack the geometric observation residuals, semantic observation residuals, and cross-frame semantic consistency residuals of the lidar to construct the joint observation vector of the extended Kalman filter; through the block sequential update strategy, the tight coupling fusion of geometric and semantic information is completed to output the optimal pose estimate. The semantic map construction module is used to construct semantically enhanced 3D maps using a global voxel grid structure. It uses an exponential moving average strategy to update voxel centers and semantic distribution, generating a spatiotemporally stable and semantically rich 3D voxel map.

9. A computer storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method described in any one of claims 1-7.

10. A computer device, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the program to implement the method of any one of claims 1-7.