A multi-frame point cloud pre-segmentation method and system based on a semantic segmentation large model
By using a multi-frame point cloud pre-segmentation method based on a large semantic segmentation model, and utilizing multi-sensor data and inertial navigation units for point cloud registration and voxel fusion, the problem of high annotation cost and error accumulation in automatic semantic segmentation of 3D point clouds is solved, and a high-quality static environmental semantic map is generated.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 广州祺宸科技有限公司
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244070A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of point cloud processing technology, specifically relating to a multi-frame point cloud pre-segmentation method and system based on a large semantic segmentation model. Background Technology
[0002] Currently, the technical approaches to achieve automatic semantic segmentation and annotation of 3D point clouds mainly fall into two categories. The first category is segmentation methods based on 3D deep learning networks, such as PointNet++ and RandLA-Net. The second category is annotation methods based on 2D-3D projection, with PointPainting and its variants being typical examples. These methods utilize mature 2D image segmentation techniques (such as semantic segmentation networks or large visual models (SAMs)) to segment the image, and then project the 2D segmentation results back onto the 3D point cloud.
[0003] However, the above-mentioned schemes generally adopt the "nearest point iteration between adjacent frames" strategy when registering long sequence point clouds. The registration error will accumulate as the sequence grows, resulting in distortion of the final global map. Most methods also fail to distinguish between dynamic and static objects, causing moving objects to form shadows in the fused map and polluting the structure of the static scene. Summary of the Invention
[0004] The technical problem to be solved by this invention is to overcome the current problems of high cost of semantic annotation of 3D point clouds, incomplete annotation due to occlusion of single frame projection, accumulation of registration error in long sequence, and motion blur of dynamic objects in fused maps.
[0005] To address the aforementioned technical problems, a first aspect of this invention discloses a multi-frame point cloud pre-segmentation method based on a large semantic segmentation model, the method comprising:
[0006] Collect data from multiple sensors, including at least multiple frames of lidar point clouds, camera images synchronized with each frame of point cloud in time, and pose data provided by the inertial navigation unit.
[0007] For each frame of the camera image, 2D semantic segmentation is performed using a large semantic segmentation model that supports open vocabulary, and the 2D semantic labels obtained from the segmentation are projected onto the corresponding single-frame point cloud, and each point is assigned an initial semantic label and confidence level.
[0008] Guided by the pose data of the inertial navigation unit, the current frame point cloud is registered with the reference point cloud constructed by accumulating point clouds from multiple historical frames to obtain the pose of each frame in the global coordinate system.
[0009] Align all registered point clouds to the global coordinate system and perform voxelization. Within each voxel, fuse and vote on the semantic information of points from different frames to determine the final semantics of each voxel.
[0010] The final semantic result obtained from the fusion voting is back-projected back onto each original single-frame point cloud.
[0011] As an optional implementation, in the first aspect of the present invention, the acquisition of multi-sensor data includes:
[0012] For a single-frame point cloud set acquired by multiple lidar sensors, the three-dimensional coordinates of each point are homogenized.
[0013] By utilizing the calibration extrinsic parameters of each laser radar to the carrier body, the homogenized point cloud is transformed to the carrier coordinate system;
[0014] All the point clouds transformed by LiDAR are merged to form a single-frame fused point cloud.
[0015] As an optional implementation, in the first aspect of the present invention, the step of performing 2D semantic segmentation on each frame of the camera image using a large semantic segmentation model that supports open vocabulary, and projecting the segmented 2D semantic labels onto the corresponding single-frame point cloud, and assigning an initial semantic label and confidence level to each point includes:
[0016] Using the SAM3 model, multi-class segmentation is performed on each camera image with text prompts to generate 2D semantic label maps and prediction confidence maps;
[0017] The semantic labels and confidence scores of the images, combined with camera intrinsic and extrinsic parameters, are projected onto the corresponding single-frame point cloud, and each point is assigned an initial semantic label, confidence score, and static prior.
[0018] As an optional implementation, in the first aspect of the present invention, the guidance based on the pose data of the inertial navigation unit, which involves registering the current frame point cloud with a reference point cloud constructed from accumulated point clouds of multiple historical frames, includes:
[0019] Using the pose provided by the inertial navigation unit, calculate the initial transformation matrix of the current frame relative to the starting frame, as the initial value for registration;
[0020] Determine whether the current frame is a keyframe. The criteria for determining a keyframe include the displacement increment, rotation increment, or the point cloud overlap rate with the previous keyframe exceeding a preset threshold.
[0021] A cumulative reference point cloud is constructed, which is formed by transforming the point clouds of all keyframes up to the current frame into the global coordinate system after the current optimal pose, and then performing voxel downsampling and stitching.
[0022] Using the initial transformation matrix as the initial value, the generalized iterative nearest point algorithm is used to register the current frame point cloud with the accumulated reference point cloud to obtain the optimized pose.
[0023] As an optional implementation, in the first aspect of the present invention, aligning all registered point clouds to the global coordinate system and performing voxelization, and fusing and voting on the semantic information of points from different frames within each voxel includes:
[0024] The registered multi-frame point cloud is stitched together and then divided into voxel spaces according to a preset resolution.
[0025] Within each voxel, static and dynamic points are distinguished based on semantic priors;
[0026] For the static point, the sum of the confidence scores of its different semantic categories is calculated, and the category with the highest sum of confidence scores is determined as the final semantic label of the voxel;
[0027] For the dynamic points, retain their best semantic labels from a single frame observation, or perform independent compensation and fusion based on their motion trajectory.
[0028] As an optional implementation, in the first aspect of the invention, the step of back-projecting the final semantic result obtained from the fusion voting back to each original single-frame point cloud includes:
[0029] Based on the recorded frame index, the semantic information after voxel fusion is back-allocated to the corresponding points of each original frame;
[0030] For points that still have no semantic labels after assignment, semantic labels are propagated from their neighboring labeled points within the single-frame point cloud using the K-nearest neighbor algorithm.
[0031] As an optional implementation, in the first aspect of the present invention, the method further includes:
[0032] Output a single-frame point cloud sequence with the final semantic label, a complete global semantic point cloud map, and a separated dynamic target point cloud set.
[0033] As an optional implementation, in the first aspect of the invention, before the step of registering the current frame point cloud with a reference point cloud constructed from accumulated point clouds of historical multiple frames, the method further includes:
[0034] Based on the static prior in the initial semantic labels, points identified as dynamic objects are filtered out from the current frame point cloud; or,
[0035] During the registration process, points whose registration residuals consistently exceed a threshold are marked and temporarily removed as dynamic points.
[0036] The second aspect of this invention discloses a multi-frame point cloud pre-segmentation system based on a large semantic segmentation model, used to implement the multi-frame point cloud pre-segmentation method based on a large semantic segmentation model described in any of the above embodiments, the system comprising:
[0037] The data acquisition module is used to acquire multi-sensor data, which includes at least multiple frames of lidar point cloud, camera images synchronized with each frame of point cloud, and pose data provided by the inertial navigation unit.
[0038] The semantic segmentation projection module is used to perform 2D semantic segmentation on each frame of the camera image using a large semantic segmentation model that supports open vocabulary, and to project the segmented 2D semantic labels onto the corresponding single-frame point cloud, assigning an initial semantic label and confidence score to each point;
[0039] The cumulative registration module is used to register the current frame point cloud with a reference point cloud constructed by accumulating point clouds from multiple historical frames, based on the pose data of the inertial navigation unit, so as to obtain the pose of each frame in the global coordinate system.
[0040] The voxel fusion module is used to align all registered point clouds to the global coordinate system and perform voxelization. Within each voxel, the semantic information of points from different frames is fused and voted to determine the final semantics of each voxel.
[0041] The back-projection module is used to back-project the final semantic result obtained by the voxel fusion module back to each original single-frame point cloud.
[0042] A third aspect of this invention discloses another multi-frame point cloud pre-segmentation system based on a large semantic segmentation model, the system comprising:
[0043] Memory containing executable program code;
[0044] A processor coupled to the memory;
[0045] The processor calls the executable program code stored in the memory to execute the multi-frame point cloud pre-segmentation method based on a large semantic segmentation model disclosed in the first aspect of the present invention.
[0046] The fourth aspect of this invention discloses a computer-readable storage medium storing computer instructions, which, when invoked by a processor, are used to execute a multi-frame point cloud pre-segmentation method based on a large semantic segmentation model disclosed in the first aspect of this invention.
[0047] Compared with the prior art, the beneficial effects of the present invention are:
[0048] By automatically generating image semantic labels using a large 2D visual model that supports open vocabulary and projecting them onto the point cloud, point cloud annotation is transformed from manual point-by-point annotation to automatic generation and verification based on the large 2D model, reducing the dependence on professional manpower and the cost of annotation. Through multi-frame cumulative fusion and a confidence voting mechanism within voxels, point clouds in occluded areas in a single frame can obtain labels in other observation frames, and consistent results are obtained through voting, improving the integrity of annotation. An inertial navigation unit-guided cumulative generalized iterative nearest-point registration strategy is adopted to align the new frame with the reference point cloud accumulated from multiple historical frames, rather than just with the previous frame, suppressing the frame-by-frame propagation of registration errors and improving the geometric accuracy of long sequence point cloud stitching. In the fusion stage, dynamic points and static points are separated and fused in a divide-and-conquer manner based on semantic priors, avoiding the ghosting of dynamic objects in the static map and outputting a cleaner static environment semantic map. Through back projection and K-nearest neighbor propagation mechanism, while generating a globally consistent map, semantic labels can be mapped back to the point cloud of each original frame, providing a directly usable data interface for downstream tasks that require temporal information. Attached Figure Description
[0049] The specific embodiments of the present invention will be further described in detail below with reference to the accompanying drawings, wherein:
[0050] Figure 1 This is a flowchart illustrating a multi-frame point cloud pre-segmentation method based on a large semantic segmentation model disclosed in an embodiment of the present invention.
[0051] Figure 2 This is a schematic diagram of the structure of a multi-frame point cloud pre-segmentation system based on a large semantic segmentation model disclosed in an embodiment of the present invention;
[0052] Figure 3 This is a schematic diagram of another multi-frame point cloud pre-segmentation system based on a large semantic segmentation model disclosed in an embodiment of the present invention. Detailed Implementation
[0053] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0054] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, apparatus, product, or end that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or ends.
[0055] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.
[0056] This invention discloses a multi-frame point cloud pre-segmentation method and system based on a large semantic segmentation model. By utilizing a 2D visual large model that supports open vocabulary to automatically generate image semantic labels and project them onto the point cloud, point cloud annotation is transformed from manual point-by-point annotation to automatic generation and verification based on the 2D large model, reducing the dependence on professional manpower and the cost of annotation. Through multi-frame cumulative fusion and a confidence voting mechanism within voxels, point clouds in occluded areas in a single frame can obtain labels in other observation frames, and consistent results are obtained through voting, improving the integrity of annotation. A cumulative generalized iterative nearest-point registration strategy guided by an inertial navigation unit is adopted to align the new frame with the reference point cloud accumulated from multiple historical frames, rather than just with the previous frame, suppressing the frame-by-frame propagation of registration errors and improving the geometric accuracy of long sequence point cloud stitching.
[0057] Example 1
[0058] Please see Figure 1 , Figure 1 This is a flowchart illustrating a multi-frame point cloud pre-segmentation method based on a large semantic segmentation model disclosed in an embodiment of the present invention. Figure 1 The described multi-frame point cloud pre-segmentation method based on a large semantic segmentation model is applied to a data processing chip, processing terminal, or processing server, and the processing server can be a local server or a cloud server; this embodiment of the invention does not limit the application. Figure 1 As shown, this multi-frame point cloud pre-segmentation method based on a large semantic segmentation model can include the following operations:
[0059] 101. Collect multi-sensor data, including at least multiple frames of LiDAR point clouds, camera images synchronized with each frame of point cloud, and pose data provided by the inertial navigation unit.
[0060] Specifically, this step ensures the spatiotemporal consistency of multi-source sensing data by simultaneously acquiring multiple frames of LiDAR point clouds, time-corresponding camera images, and pose data from the inertial navigation unit (INU). This synchronization mechanism provides a precise data foundation for the subsequent accurate projection of 2D image semantics onto the 3D point cloud and for initial registration guidance using INU poses.
[0061] As can be seen, this invention ensures the accurate alignment of multimodal information from the data source, providing a reliable and consistent input for the subsequent implementation of the complete technical process of "multi-frame point cloud semantic segmentation based on 2D large model", thereby supporting high-precision point cloud registration, effective cross-modal semantic association and the construction of the final global semantic map.
[0062] 102. For each frame of the camera image, perform 2D semantic segmentation using a large semantic segmentation model that supports open vocabulary, and project the obtained 2D semantic labels onto the corresponding single-frame point cloud, assigning an initial semantic label and confidence level to each point.
[0063] Specifically, by utilizing a large 2D semantic segmentation model that supports open vocabularies to process images and performing projection onto 3D point clouds, semantic transfer from rich 2D visual knowledge to 3D geometric space is achieved. This enables the system to automatically generate preliminary semantic annotations and confidence assessments for point clouds based solely on textual descriptions, bypassing the data preparation step of directly performing 3D annotation.
[0064] It is evident that shifting the source of point cloud semantic labels from relying on costly professional manual 3D annotation to utilizing the automated output of existing 2D vision models significantly reduces reliance on manual annotation and data costs, while providing an initial label foundation with semantic consistency and measurable confidence for subsequent multi-frame fusion.
[0065] 103. Guided by the pose data of the inertial navigation unit, the current frame point cloud is registered with the reference point cloud constructed by accumulating point clouds from multiple historical frames to obtain the pose of each frame in the global coordinate system.
[0066] Specifically, by utilizing the pose provided by the inertial navigation unit as an initial estimate, and registering the current frame point cloud with a reference point cloud constructed from accumulated and downsampled points across multiple historical frames, this approach changes the traditional frame-to-frame sequential registration strategy. This cumulative registration method ensures that each registration is based on a more stable and broader-coverage geometric reference.
[0067] As can be seen, this invention utilizes the rich geometric constraints provided by all past keyframes to correct the pose of the current frame, suppressing the tendency of registration errors to propagate and accumulate frame by frame in long sequence data, thereby improving the overall geometric consistency of multi-frame point cloud stitching in large-scale scenes and providing the necessary spatial alignment foundation for subsequent accurate multi-frame semantic fusion.
[0068] 104. Align all registered point clouds to the global coordinate system and perform voxelization. Within each voxel, perform fusion voting on the semantic information of points from different frames to determine the final semantics of each voxel.
[0069] Specifically, by dividing multi-frame point clouds into voxels under a unified global coordinate system, independent semantic decision units are created for each local region in space. Within each voxel, the semantic information and confidence of points from different times and perspectives are statistically analyzed and voted on, and the final semantic label of the voxel is determined according to preset rules.
[0070] It is evident that this multi-frame information fusion mechanism based on spatial voxels can effectively integrate local observations of occluded objects in different frames, utilize multi-view complementarity to solve the problem of incomplete single-frame annotation, and eliminate the uncertainty or error of single-frame annotation through statistical decision-making, thereby generating continuous, consistent and complete semantic segmentation results in three-dimensional space.
[0071] 105. The final semantic result obtained from the fusion voting is back-projected back to each original single-frame point cloud.
[0072] Specifically, the final semantic result determined by multi-frame fusion in the global coordinate system is remapped back to the data points of each frame point cloud in its original coordinates based on the frame index relationship at the time of original acquisition.
[0073] As can be seen, this invention's mechanism generates high-quality, globally consistent semantic maps while preserving the temporal information and point cloud structure of the original data stream. This allows the output to be used not only for offline mapping and planning but also to directly serve downstream tasks that require semantic information of the original frame sequence, such as real-time perception for autonomous driving and dynamic object tracking, thus expanding the system's application scope.
[0074] As an optional embodiment, the step described above, namely acquiring multi-sensor data, includes:
[0075] For a single-frame point cloud set acquired by multiple lidar sensors, the three-dimensional coordinates of each point are homogenized.
[0076] By utilizing the calibration extrinsic parameters of each laser radar to the carrier body, the homogenized point cloud is transformed to the carrier coordinate system;
[0077] All the point clouds transformed by LiDAR are merged to form a single-frame fused point cloud.
[0078] In this embodiment of the invention, by performing coordinate homogenization, extrinsic parameter transformation and merging processing on the original point clouds collected by multiple lidars, the precise alignment and geometric stitching of multi-source point cloud data in the carrier body coordinate system is achieved.
[0079] It is evident that unifying lidar observations from different locations and angles into the same spatial reference frame generates a geometrically consistent and more complete single-frame fused point cloud, providing an accurate and unified geometric data foundation for subsequent 2D semantic projection, cross-frame registration, and global map construction.
[0080] As an optional embodiment, the step described above, namely, performing 2D semantic segmentation on each frame of the camera image using a large semantic segmentation model that supports open vocabulary, projecting the segmented 2D semantic labels onto the corresponding single-frame point cloud, and assigning an initial semantic label and confidence level to each point, includes:
[0081] Using the SAM3 model, multi-class segmentation is performed on each camera image with text prompts to generate 2D semantic label maps and prediction confidence maps;
[0082] The semantic labels and confidence scores of the images, combined with camera intrinsic and extrinsic parameters, are projected onto the corresponding single-frame point cloud, and each point is assigned an initial semantic label, confidence score, and static prior.
[0083] In this embodiment of the invention, by employing a large 2D semantic segmentation model that supports open vocabulary to process images, and utilizing camera geometric parameters to project the segmentation results onto a 3D point cloud, a direct and automated semantic information transfer from two-dimensional image semantics to three-dimensional point cloud space is achieved. It assigns each 3D point a semantic category, prediction confidence, and category-based static or dynamic attribute priors from the two-dimensional visual model.
[0084] As can be seen, this invention bypasses the dependence on dedicated 3D annotation data and models, and can flexibly define the semantic categories to be identified through text prompts. It also provides initial semantic annotations with quantified confidence and dynamic and static priors for subsequent multi-frame point cloud processing in the form of projection, thus establishing the foundation for the association of multimodal perception information.
[0085] As an optional embodiment, the step described above, which guides the registration of the current frame point cloud with a reference point cloud constructed from accumulated historical point clouds based on the inertial navigation unit pose data, includes:
[0086] Using the pose provided by the inertial navigation unit, calculate the initial transformation matrix of the current frame relative to the starting frame, as the initial value for registration;
[0087] Determine whether the current frame is a keyframe. The criteria for determining a keyframe include the displacement increment, rotation increment, or the point cloud overlap rate with the previous keyframe exceeding a preset threshold.
[0088] A cumulative reference point cloud is constructed, which is formed by transforming the point clouds of all keyframes up to the current frame into the global coordinate system after the current optimal pose, and then performing voxel downsampling and stitching.
[0089] Using the initial transformation matrix as the initial value, the generalized iterative nearest point algorithm is used to register the current frame point cloud with the accumulated reference point cloud to obtain the optimized pose.
[0090] In this embodiment of the invention, the pose of the inertial navigation unit is used to provide an accurate initial transformation estimate for point cloud registration, and a stable reference point cloud is constructed and updated by adaptively selecting key frames based on motion and overlap. Then, the generalized iterative nearest point algorithm is used to perform registration between the current frame and the accumulated reference point cloud.
[0091] As can be seen, this invention utilizes the geometric information of all past keyframes as the registration target, providing stronger spatial constraints for the current frame. This not only improves the convergence efficiency of single registration but also effectively suppresses the long-term drift problem caused by the successive accumulation of registration errors between frames, ensuring the accuracy of spatial alignment of long sequence point clouds in the global coordinate system.
[0092] As an optional embodiment, the step described above, aligning all registered point clouds to the global coordinate system and performing voxelization, and fusing and voting on the semantic information of points from different frames within each voxel, includes:
[0093] The registered multi-frame point cloud is stitched together and then divided into voxel spaces according to a preset resolution.
[0094] Within each voxel, static and dynamic points are distinguished based on semantic priors;
[0095] For the static point, the sum of the confidence scores of its different semantic categories is calculated, and the category with the highest sum of confidence scores is determined as the final semantic label of the voxel;
[0096] For the dynamic points, retain their best semantic labels from a single frame observation, or perform independent compensation and fusion based on their motion trajectory.
[0097] In this embodiment of the invention, multi-frame point clouds are divided into voxel grids in a unified coordinate system, and static and dynamic points are distinguished within each voxel unit based on semantic priors. For static points, a cross-frame voting mechanism based on confidence summation is used to determine the final semantics, while for dynamic points, a strategy of retaining the best observation or trajectory compensation is adopted for independent processing.
[0098] It is evident that this divide-and-conquer fusion mechanism can integrate multi-view observations to supplement the semantic information of occluded areas, improve the consistency and completeness of semantic annotation of static scenes through statistical decision-making, and at the same time avoid dynamic objects from interfering with static maps, thus achieving the separation of dynamic and static semantic information.
[0099] As an optional embodiment, the step above, which involves backprojecting the final semantic result obtained from the fusion voting back to each original single-frame point cloud, includes:
[0100] Based on the recorded frame index, the semantic information after voxel fusion is back-allocated to the corresponding points of each original frame;
[0101] For points that still have no semantic labels after assignment, semantic labels are propagated from their neighboring labeled points within the single-frame point cloud using the K-nearest neighbor algorithm.
[0102] In this embodiment of the invention, the semantic information of the fused voxels is accurately transmitted back to the corresponding original data points based on the recorded indexes of each original point cloud frame. For points that still have no labels after transmission, labels are propagated within their respective single frames based on spatial proximity using the K-nearest neighbor algorithm, thus completing the closed-loop transmission of semantic information from the global fusion result to each original single frame.
[0103] As can be seen, this operation, based on the output of a globally consistent semantic map, assigns each original single-frame point cloud with a high-quality semantic label derived from multi-frame fusion. At the same time, it compensates for the projection blind spot through K-nearest neighbor propagation, ensuring the integrity of the single-frame semantic results. This allows the output to simultaneously meet the needs of offline mapping and downstream real-time perception tasks that require the original temporal point cloud semantics.
[0104] As an optional embodiment, the method described above further includes:
[0105] Output a single-frame point cloud sequence with the final semantic label, a complete global semantic point cloud map, and a separated dynamic target point cloud set.
[0106] In this embodiment of the invention, by synchronously outputting single-frame point cloud sequences with temporal semantic tags, complete and coherent global semantic point cloud maps, and independently separated dynamic target point cloud sets, multiple data expression forms are provided to meet the needs of different downstream tasks.
[0107] As can be seen, the multi-format output design enables the results of this method to simultaneously serve tasks that require real-time sensing, synchronous positioning and map building, offline high-precision map building and path planning that require global consistency, and tasks that perform special analysis of dynamic environments, thereby enhancing the practicality of the entire technical solution in practical applications.
[0108] As an optional embodiment, the above steps, prior to the step of registering the current frame point cloud with a reference point cloud constructed from accumulated historical point clouds, further include:
[0109] Based on the static prior in the initial semantic labels, points identified as dynamic objects are filtered out from the current frame point cloud; or,
[0110] During the registration process, points whose registration residuals consistently exceed a threshold are marked and temporarily removed as dynamic points.
[0111] In this embodiment of the invention, by identifying and excluding point cloud data corresponding to dynamic objects based on semantic priors or geometric residuals before point cloud registration, it is ensured that the point clouds participating in the registration calculation mainly come from static environmental structures.
[0112] As can be seen, this invention reduces the interference of geometric inconsistencies caused by the movement of dynamic objects such as vehicles and pedestrians on the registration optimization process, allowing pose estimation to focus more on stable static scenes, thereby improving the accuracy of multi-frame point cloud spatial alignment and providing a more reliable geometric basis for generating accurate global semantic maps.
[0113] Furthermore, raw data is acquired. The control drone platform simultaneously collects multi-line LiDAR point clouds, surround-view camera images, and inertial navigation (INU) pose data.
[0114] The specific steps are as follows:
[0115] 1. For multiple single-frame point cloud sets acquired by the j-th LiDAR, , where each point pi(j) represents its 3D coordinates in Flidar_j;
[0116] homogenize it to Apply extrinsic parameter transformation: Dehomogenization yields the points in this system: The point clouds of all LiDAR data are transformed and then merged to form a single-frame fused point cloud: .
[0117] 2. 2D Image Semantic Segmentation and 3D Projection Based on Large Models: This approach leverages the knowledge from large 2D models to provide initial semantic labels for 3D point clouds. Using open-vocabulary models such as SAM3, each camera image is segmented into multiple categories based on textual prompts, generating 2D semantic label maps and prediction confidence maps. The RGB colors, semantic labels, and confidence scores of the images are combined with camera intrinsic and extrinsic parameters and simultaneously projected onto the corresponding single-frame point cloud, assigning each point an initial color, semantic label, and confidence score (including static priors).
[0118] 3. INU-guided and cumulative fine registration precisely aligns multi-frame point clouds to a unified coordinate system, creating geometric conditions for fusion. Using the coarse pose provided by INU, the initial transformation of each frame relative to the starting frame is calculated: the 6-DOF pose TINUi ∈ SE(3) (including position and orientation) of each frame at time ti is obtained from the INU system; the relative transformation with respect to the starting frame t0 is calculated as the initial guess: This initial pose is used to guide subsequent GICP, significantly improving convergence speed and robustness (especially in low-texture / repetitive structure scenes).
[0119] Keyframe selection and cumulative reference point cloud construction: To avoid the reference point cloud growing indefinitely, an adaptive keyframe mechanism is introduced.
[0120] 1) Keyframe determination conditions (triggered if any one of them is met): displacement increment > dthr (e.g., 0.5 m); rotation increment > (e.g., 5°); point cloud overlap rate between the current frame and the nearest keyframe < etathr (e.g., 70%).
[0121] 2) Construction of cumulative reference point cloud Ri-1: Where Ki-1 is the set of all keyframe indices up to the (i-1)th frame; Pkglobal is the global point cloud after the current optimal pose transformation in the kth frame; the voxel downsampling resolution is set to rref=0.1m to balance density and computational cost.
[0122] Cumulative GICP fine-tuning: Using Tiniti as the initial value, perform generalized ICP (GICP) optimization:
[0123] , where πR(·) represents the nearest neighbor projection on the reference point cloud R, and the covariance ∑ is estimated by the local surface fitting; update the global pose: Tglobali =Topti ·Tglobal0, if t0 is the origin.
[0124] d. Dynamic point removal: To improve registration accuracy, suspected dynamic points can be removed before registration: using the static prior generated in step 2 to filter out dynamic label points; or based on point cloud residuals: if a point has a consistently large residual in multiple registrations, it is marked as dynamic and temporarily removed.
[0125] 4. Voxelized multi-frame semantic fusion: All registered point clouds are stitched together. The voxel size can be configured between 0.05 m and 0.2 m according to the application scenario. The default size is 0.1 m to balance semantic fusion accuracy and computational efficiency.
[0126] The key innovative step within each voxel is to first identify and separate static points from dynamic points (based on semantic priors).
[0127] Static point fusion: Only static points are ranked using a "confidence score summation" method to determine the final semantic label for each voxel. The summation of the semantic confidence scores for each category of point cloud and the background category priority are then used in a multi-level ranking process. The voxel with the highest score is selected as the final semantic label. Background category priority is based on: artificial infrastructure (buildings, garbage dumps); and basic background categories (roads, vegetation, lakes). The fusion color is calculated using methods such as the "median method." This is the core fusion process.
[0128] Dynamic point processing: Dynamic points do not participate in the static map voting. Their labels and colors can retain their best single-frame observation values, or be independently compensated and fused on their motion trajectories.
[0129] 5. Back projection and single-frame label optimization: Based on the recorded frame index, the semantic and color information after voxel fusion is back-allocated to the point cloud of each original frame.
[0130] For points that are still unlabeled after assignment (such as those in the blind spots of all cameras), the K-Nearest Neighbors (KNN) algorithm is used within the single frame to propagate semantic labels from neighboring points with existing labels.
[0131] 6. Multi-format output provides outputs that adapt to different application scenarios, including single-frame point cloud sequences with semantic tags (for SLAM and dynamic analysis), complete global semantic point cloud maps (for mapping and planning), and separate dynamic target point cloud sets.
[0132] It should be noted that the goal of cumulative GICP precise registration is to register the point cloud Pi of the i-th frame to the reference point cloud Ri-1 constructed from the previous i-1 frames, rather than registering it only with Pi-1. The following is a code snippet for this step:
[0133] Input: LiDAR frame sequence {P0, P1, ..., P} n-1}, INU pose {T_inu 0 , ..., T_inuⁿ⁻¹}
[0134] Output: Globally optimized pose {T_global} 0 , ..., T_globalⁿ⁻¹}
[0135] 1. T_global 0 ← I
[0136] 2. keyframes ← [0]
[0137] 3. R ← VoxelDownsample(P0, r=0.1) / / Cumulative reference point cloud
[0138] 4. for i = 1 to n−1 do
[0139] 5. / / INU initialization guidance
[0140] 6. T_init ← (T_inu 0 )⁻¹ ⋅ T_inu 1 7.
[0142] 8. / / (Optional) Remove dynamic points
[0143] 9. P_i_static ← FilterDynamic(P_i, static_mask_i) 10.
[0145] 11. / / Cumulative GICP registration
[0146] 12. T_opt ← GICP(source=P_i_static, target=R, init=T_init)
[0147] 13. T_global 1 ← T_opt 14.
[0149] 15. / / Global point cloud (for fusion)
[0150] 16. P_i^global ← T_global 1 ⋅ P_i 17.
[0152] 18. / / Keyframe detection
[0153] 19. if MotionOrLowOverlap(T_global 1 , T_global^{last_kf}) then
[0154] 20. keyframes.append(i)
[0155] 21. R ← VoxelDownsample( R ∪ P_i^global, r=0.1 )
[0156] 22. end if twenty three.
[0158] 24. / / (Optional) Limit the history length of the sliding window
[0159] 25. if |keyframes| > K_max then
[0160] 26. Rebuild R from recent keyframes
[0161] 27. end if
[0162] 28. end for
[0163] Mathematical form (GICP loss):
[0164] Where π(·) is the nearest neighbor projection, and ∑ is the point covariance (estimated by the local surface).
[0165] Intra-voxel multi-frame voting fusion mechanism;
[0166] Input: All registered point clouds {Piglobal}, each point contains a semantic label s ∈ {C}, a confidence score c ∈ [0,1], and a staticity flag δ ∈ {0,1} (1 = static).
[0167] Input: A global point cloud set {P_i} and its attributes (location, semantic label, confidence score, color, static flags).
[0168] Output: The final semantic label and color for each voxel. The following is a code snippet for this step:
[0169] 1. GlobalPoints ← ∅
[0170] 2. for each frame i do
[0171] 3. P_i_global ← Transform(P_i, T_global[i])
[0172] 4. AttachAttributes(P_i_global, semantic_label_i, confidence_i, rgb_i, static_flag_i)
[0173] 5. GlobalPoints ← GlobalPoints ∪ P_i_global
[0174] 6. end for
[0175] 7. Initialize the voxel grid (resolution = 0.1 m)
[0176] 8. For each point p ∈ GlobalPoints, do
[0177] 9. if p.static_flag == 1 then
[0178] 10. voxel_key ← GetVoxelKey(p.position)
[0179] 11. VoxelGrid[voxel_key].AddPoint(
[0180] 12. label = p.semantic_label,
[0181] 13.confidence = p.confidence,
[0182] 14. color = p.rgb 15.)
[0184] 16. end if
[0185] 17. end for
[0186] 18. For each voxel v ∈ VoxelGrid, do
[0187] 19. if v is empty then continue 20.
[0189] 21. / / Calculate the total confidence score for each category.
[0190] 22. class_score ← Empty mapping
[0191] 23. For all possible categories c, do
[0192] 24. class_score[c] ← Σ{ p.confidence | p ∈ v.points and p.label ==c}
[0193] 25. end for 26.
[0195] 27. / / Applying manual priority rules (example)
[0196] 28. if |class_score["building"] − class_score["vegetation"]| < τ then
[0197] 29. `class_score["building"]` ← `class_score["building"] + β` / / Increases the weight of buildings.
[0198] 30. end if 31.
[0200] 32. final_label ← argmax_c(class_score[c]) 33.
[0202] 34. / / Blend colors (using median to resist anomalies)
[0203] 35. color_list ← { p.color | p ∈ v.points and p.label == final_label}
[0204] 36. final_color ← Median(color_list) 37.
[0206] 38. v.set_label(final_label)
[0207] 39. v.set_color(final_color)
[0208] 40. end for
[0209] The code snippet for KNN tag propagation is:
[0210] enter:
[0211] - The labeled point set S = {(x_j, y_j, z_j), label_j}
[0212] - Unlabeled point set U = {(x_k, y_k, z_k)}
[0213] - Parameters k (number of nearest neighbors), d_max (maximum effective distance)
[0214] Output: Propagation label of each point in U
[0215] 1. Construct the kd-tree index: ← BuildKDTree(S.positions)
[0216] 2. For each unlabeled point u ∈ U, do
[0217] 3. neighbors ← Index.KNearestNeighbors(u.position, k=k)
[0218] 4. valid_neighbors ← { n ∈ neighbors | distance(n, u) ≤ d_max} 5.
[0220] 6. If valid_neighbors is empty, then
[0221] 7. u.label ← 0 / / Background / Unknown
[0222] 8. else
[0223] 9. / / Majority vote (ignore background class 0)
[0224] 10. vote_count ← Counts the number of times each tag appears in valid_neighbors.
[0225] 11. Set vote_count[0] ← -1 / / Disable background classes from participating in the decision-making process
[0226] 12. u.label ← argmax_label(vote_count[label])
[0227] 13. end if
[0228] 14. end for
[0229] 15. Return the list of labels for U.
[0230] Example 2
[0231] Please see Figure 2 , Figure 2 This is a schematic diagram of the structure of a multi-frame point cloud pre-segmentation system based on a large semantic segmentation model disclosed in an embodiment of the present invention. Figure 2The described multi-frame point cloud pre-segmentation system based on a large semantic segmentation model can be applied to data processing chips, processing terminals, or processing servers. The processing server can be a local server or a cloud server; this embodiment of the invention does not impose any limitations. Figure 2 As shown, the multi-frame point cloud pre-segmentation system based on a large semantic segmentation model can include the following operations:
[0232] The data acquisition module 201 is used to acquire multi-sensor data, which includes at least multiple frames of lidar point clouds, camera images synchronized with each frame of point cloud, and pose data provided by the inertial navigation unit.
[0233] Specifically, this step ensures the spatiotemporal consistency of multi-source sensing data by simultaneously acquiring multiple frames of LiDAR point clouds, time-corresponding camera images, and pose data from the inertial navigation unit (INU). This synchronization mechanism provides a precise data foundation for the subsequent accurate projection of 2D image semantics onto the 3D point cloud and for initial registration guidance using INU poses.
[0234] As can be seen, this invention ensures the accurate alignment of multimodal information from the data source, providing a reliable and consistent input for the subsequent implementation of the complete technical process of "multi-frame point cloud semantic segmentation based on 2D large model", thereby supporting high-precision point cloud registration, effective cross-modal semantic association and the construction of the final global semantic map.
[0235] The semantic segmentation projection module 202 is used to perform 2D semantic segmentation on each frame of the camera image using a large semantic segmentation model that supports open vocabulary, and to project the segmented 2D semantic labels onto the corresponding single-frame point cloud, assigning an initial semantic label and confidence level to each point.
[0236] Specifically, by utilizing a large 2D semantic segmentation model that supports open vocabularies to process images and performing projection onto 3D point clouds, semantic transfer from rich 2D visual knowledge to 3D geometric space is achieved. This enables the system to automatically generate preliminary semantic annotations and confidence assessments for point clouds based solely on textual descriptions, bypassing the data preparation step of directly performing 3D annotation.
[0237] It is evident that shifting the source of point cloud semantic labels from relying on costly professional manual 3D annotation to utilizing the automated output of existing 2D vision models significantly reduces reliance on manual annotation and data costs, while providing an initial label foundation with semantic consistency and measurable confidence for subsequent multi-frame fusion.
[0238] The cumulative registration module 203 is used to register the current frame point cloud with a reference point cloud constructed by accumulating point clouds from multiple historical frames based on the pose data of the inertial navigation unit, so as to obtain the pose of each frame in the global coordinate system.
[0239] Specifically, by utilizing the pose provided by the inertial navigation unit as an initial estimate, and registering the current frame point cloud with a reference point cloud constructed from accumulated and downsampled points across multiple historical frames, this approach changes the traditional frame-to-frame sequential registration strategy. This cumulative registration method ensures that each registration is based on a more stable and broader-coverage geometric reference.
[0240] As can be seen, this invention utilizes the rich geometric constraints provided by all past keyframes to correct the pose of the current frame, suppressing the tendency of registration errors to propagate and accumulate frame by frame in long sequence data, thereby improving the overall geometric consistency of multi-frame point cloud stitching in large-scale scenes and providing the necessary spatial alignment foundation for subsequent accurate multi-frame semantic fusion.
[0241] The voxel fusion module 204 is used to align all registered point clouds to the global coordinate system and perform voxelization. Within each voxel, the semantic information of points from different frames is fused and voted to determine the final semantics of each voxel.
[0242] Specifically, by dividing multi-frame point clouds into voxels under a unified global coordinate system, independent semantic decision units are created for each local region in space. Within each voxel, the semantic information and confidence of points from different times and perspectives are statistically analyzed and voted on, and the final semantic label of the voxel is determined according to preset rules.
[0243] It is evident that this multi-frame information fusion mechanism based on spatial voxels can effectively integrate local observations of occluded objects in different frames, utilize multi-view complementarity to solve the problem of incomplete single-frame annotation, and eliminate the uncertainty or error of single-frame annotation through statistical decision-making, thereby generating continuous, consistent and complete semantic segmentation results in three-dimensional space.
[0244] The back-projection module 205 is used to back-project the final semantic result obtained by the voxel fusion module back to each original single-frame point cloud.
[0245] Specifically, the final semantic result determined by multi-frame fusion in the global coordinate system is remapped back to the data points of each frame point cloud in its original coordinates based on the frame index relationship at the time of original acquisition.
[0246] As can be seen, this invention's mechanism generates high-quality, globally consistent semantic maps while preserving the temporal information and point cloud structure of the original data stream. This allows the output to be used not only for offline mapping and planning but also to directly serve downstream tasks that require semantic information of the original frame sequence, such as real-time perception for autonomous driving and dynamic object tracking, thus expanding the system's application scope.
[0247] Example 3
[0248] Please see Figure 3 , Figure 3 This is a schematic diagram of another multi-frame point cloud pre-segmentation system based on a large semantic segmentation model disclosed in an embodiment of the present invention. Figure 3 As shown, the device may include:
[0249] Memory 301 storing executable program code;
[0250] Processor 302 coupled to memory 301;
[0251] The processor 302 calls the executable program code stored in the memory 301 to execute some or all of the steps in the multi-frame point cloud pre-segmentation method based on a large semantic segmentation model disclosed in Embodiment 1 of the present invention.
[0252] Example 4
[0253] This invention discloses a computer storage medium storing computer instructions. When these computer instructions are invoked, they are used to execute some or all of the steps in the multi-frame point cloud pre-segmentation method based on a large semantic segmentation model disclosed in Embodiment 1 of this invention.
[0254] Example 5
[0255] This invention discloses a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform the steps of a multi-frame point cloud pre-segmentation method based on a large semantic segmentation model as described in Embodiment 1.
[0256] The system embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0257] Through the detailed description of the above embodiments, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, including read-only memory (ROM), random access memory (RAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), one-time programmable read-only memory (OTPROM), electrically-erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable medium that can be used to carry or store data.
[0258] Finally, it should be noted that the multi-frame point cloud pre-segmentation method and system based on a large semantic segmentation model disclosed in the embodiments of the present invention are merely preferred embodiments of the present invention and are only used to illustrate the technical solutions of the present invention, not to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A multi-frame point cloud pre-segmentation method based on a semantic segmentation large model, characterized in that, The method includes: Collect data from multiple sensors, including at least multiple frames of lidar point clouds, camera images synchronized with each frame of point cloud in time, and pose data provided by the inertial navigation unit. For each frame of the camera image, 2D semantic segmentation is performed using a large semantic segmentation model that supports open vocabulary, and the 2D semantic labels obtained from the segmentation are projected onto the corresponding single-frame point cloud, and each point is assigned an initial semantic label and confidence level. Guided by the pose data of the inertial navigation unit, the current frame point cloud is registered with the reference point cloud constructed by accumulating point clouds from multiple historical frames to obtain the pose of each frame in the global coordinate system. Align all registered point clouds to the global coordinate system and perform voxelization. Within each voxel, fuse and vote on the semantic information of points from different frames to determine the final semantics of each voxel. The final semantic result obtained from the fusion voting is back-projected back onto each original single-frame point cloud.
2. The multi-frame point cloud pre-segmentation method based on a large semantic segmentation model according to claim 1, characterized in that, The data collected from the multiple sensors includes: For a single-frame point cloud set acquired by multiple lidar sensors, the three-dimensional coordinates of each point are homogenized. By utilizing the calibration extrinsic parameters of each laser radar to the carrier body, the homogenized point cloud is transformed to the carrier coordinate system; All the point clouds transformed by LiDAR are merged to form a single-frame fused point cloud.
3. The method of claim 1, wherein the method further comprises: For each frame of the camera image, 2D semantic segmentation is performed using a large semantic segmentation model that supports open vocabulary. The segmented 2D semantic labels are then projected onto the corresponding single-frame point cloud. Assigning an initial semantic label and confidence level to each point includes: Using the SAM3 model, multi-class segmentation is performed on each camera image with text prompts to generate 2D semantic label maps and prediction confidence maps; The semantic labels and confidence scores of the images, combined with camera intrinsic and extrinsic parameters, are projected onto the corresponding single-frame point cloud, and each point is assigned an initial semantic label, confidence score, and static prior.
4. The method of claim 1, wherein the method further comprises: The guidance based on inertial navigation unit pose data, which involves registering the current frame point cloud with a reference point cloud constructed from accumulated historical point clouds, includes: Using the pose provided by the inertial navigation unit, calculate the initial transformation matrix of the current frame relative to the starting frame, as the initial value for registration; Determine whether the current frame is a keyframe. The criteria for determining a keyframe include the displacement increment, rotation increment, or the point cloud overlap rate with the previous keyframe exceeding a preset threshold. A cumulative reference point cloud is constructed, which is formed by transforming the point clouds of all keyframes up to the current frame into the global coordinate system after the current optimal pose, and then performing voxel downsampling and stitching. Using the initial transformation matrix as the initial value, the generalized iterative nearest point algorithm is used to register the current frame point cloud with the accumulated reference point cloud to obtain the optimized pose.
5. The multi-frame point cloud pre-segmentation method based on a large semantic segmentation model according to claim 1, characterized in that, The step of aligning all registered point clouds to the global coordinate system and performing voxelization, and then fusing and voting on the semantic information of points from different frames within each voxel, includes: The registered multi-frame point cloud is stitched together and then divided into voxel spaces according to a preset resolution. Within each voxel, static and dynamic points are distinguished based on semantic priors; For the static point, the sum of the confidence scores of its different semantic categories is calculated, and the category with the highest sum of confidence scores is determined as the final semantic label of the voxel; For the dynamic points, retain their best semantic labels from a single frame observation, or perform independent compensation and fusion based on their motion trajectory.
6. The multi-frame point cloud pre-segmentation method based on a large semantic segmentation model according to claim 1, characterized in that, The step of back-projecting the final semantic result obtained from the fusion voting back to each original single-frame point cloud includes: Based on the recorded frame index, the semantic information after voxel fusion is back-allocated to the corresponding points of each original frame; For points that still have no semantic labels after assignment, semantic labels are propagated from their neighboring labeled points within the single-frame point cloud using the K-nearest neighbor algorithm.
7. The multi-frame point cloud pre-segmentation method based on a large semantic segmentation model according to claim 1, characterized in that, The method further includes: Output a single-frame point cloud sequence with the final semantic label, a complete global semantic point cloud map, and a separated dynamic target point cloud set.
8. The multi-frame point cloud pre-segmentation method based on a large semantic segmentation model according to claim 1, characterized in that, Before the step of registering the current frame point cloud with a reference point cloud constructed from accumulated point clouds from multiple historical frames, the method further includes: Based on the static prior in the initial semantic labels, points identified as dynamic objects are filtered out from the current frame point cloud; or, During the registration process, points whose registration residuals consistently exceed a threshold are marked and temporarily removed as dynamic points.
9. A multi-frame point cloud pre-segmentation system based on a large semantic segmentation model, used to implement the multi-frame point cloud pre-segmentation method based on a large semantic segmentation model as described in any one of claims 1-8, characterized in that, The system includes: The data acquisition module is used to acquire multi-sensor data, which includes at least multiple frames of lidar point cloud, camera images synchronized with each frame of point cloud, and pose data provided by the inertial navigation unit. The semantic segmentation projection module is used to perform 2D semantic segmentation on each frame of the camera image using a large semantic segmentation model that supports open vocabulary, and to project the segmented 2D semantic labels onto the corresponding single-frame point cloud, assigning an initial semantic label and confidence score to each point; The cumulative registration module is used to register the current frame point cloud with a reference point cloud constructed by accumulating point clouds from multiple historical frames, based on the pose data of the inertial navigation unit, so as to obtain the pose of each frame in the global coordinate system. The voxel fusion module is used to align all registered point clouds to the global coordinate system and perform voxelization. Within each voxel, the semantic information of points from different frames is fused and voted to determine the final semantics of each voxel. The back-projection module is used to back-project the final semantic result obtained by the voxel fusion module back to each original single-frame point cloud.
10. A multi-frame point cloud pre-segmentation system based on a large semantic segmentation model, characterized in that, The system includes: Memory containing executable program code; A processor coupled to the memory; The processor calls the executable program code stored in the memory to execute a multi-frame point cloud pre-segmentation method based on a large semantic segmentation model as described in any one of claims 1-8.