A point cloud segmentation method of multi-scale local feature aggregation and global feature refinement
By constructing a multi-scale local neighborhood structure and a global feature refinement method, combined with an adaptive feature fusion strategy, the problems of insufficient multi-scale target representation ability and high computational complexity in point cloud semantic segmentation are solved, achieving efficient and accurate point cloud segmentation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ANHUI POLYTECHNIC UNIV
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-26
Smart Images

Figure CN122289686A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of 3D point cloud data processing and computer vision technology, and in particular to a point cloud segmentation method based on multi-scale local feature aggregation and global feature refinement. Background Technology
[0002] 3D point cloud semantic segmentation is one of the key technologies in 3D scene understanding. Its goal is to assign a corresponding semantic category label to each point in the point cloud, which has important application value in fields such as autonomous driving, robot perception, and virtual reality. Unlike 2D images, 3D point cloud data is unstructured, sparsely distributed, and spatially non-uniform, making it difficult to directly apply traditional image processing methods based on regular grids.
[0003] Existing deep learning-based point cloud semantic segmentation methods typically model local geometric structure information by constructing local neighborhoods and aggregating neighborhood features. For example, the published document with application publication number CN119785032A, publication date 2025-04-08, and patent title "A Point Cloud Semantic Segmentation Method and System Based on Local Neighborhood Attention" discloses a method including the following steps: S101, dividing 3D point cloud data into training, validation, and test sets and preprocessing them; S102, constructing a deep learning network based on local neighborhood attention, using the training set to train and the validation set to obtain the optimal model; S103, inputting the test set into the optimal model to obtain the semantic segmentation result.
[0004] However, in practical applications, the following shortcomings still exist: On the one hand, most methods adopt fixed-scale neighborhood and simple feature aggregation strategies, which are difficult to fully characterize the geometric structural differences and semantic relationships between points within the neighborhood, resulting in insufficient expressive ability for multi-scale targets in complex scenes; on the other hand, although global modeling methods that introduce self-attention mechanisms can capture long-range dependencies, their computational complexity increases with the square of the number of points, resulting in large computational overhead in large-scale point cloud scenarios; in addition, existing methods mostly adopt static fusion methods such as feature concatenation or element-wise addition in the stage of local and global feature fusion, lacking the ability to dynamically adjust the fusion ratio according to semantic context, which easily introduces redundant information or weakens local geometric details. Summary of the Invention
[0005] The technical problem to be solved by this invention is to realize a point cloud segmentation method based on multi-scale local feature aggregation and global feature refinement. By using multi-scale local feature enhancement, efficient global feature modeling and adaptive feature fusion strategy, the method can improve the accuracy, robustness and generalization ability of 3D point cloud semantic segmentation while ensuring computational efficiency.
[0006] To achieve the above objectives, the technical solution adopted by this invention is: a point cloud segmentation method with multi-scale local feature aggregation and global feature refinement, comprising the following steps:
[0007] Step 1: Acquire and preprocess point cloud data to obtain a point cloud training dataset;
[0008] Step 2: Construct a multi-scale local neighborhood structure for the point cloud, extract multi-scale local features, and form a point-level feature representation;
[0009] Step 3: Select representative key points from the point cloud data for modeling, and perform long-range dependency modeling through the global feature refinement module to enhance global information;
[0010] Step 4: Use the adaptive feature fusion module to dynamically weight and fuse local features and global refined features based on the learned fusion weights;
[0011] Step 5: Through upsampling and feature propagation, predict semantic labels point by point and output the final point cloud semantic segmentation result.
[0012] In step 1, the input 3D point cloud data is preprocessed, and a multi-scale local neighborhood structure of the point cloud is constructed.
[0013] In step 2, local feature enhancement is achieved by jointly encoding the relative spatial location information and feature differences between the center point and neighboring points in the neighborhood at different scales. Then, the neighborhood features are aggregated by a joint pooling strategy to obtain a multi-scale local feature representation.
[0014] In step 3, based on multi-scale local features, feature refinement operation is introduced, and representative key points are selected from the point cloud through a learnable key point selection mechanism to construct a sparse global attention structure based on key points, so as to capture long-range dependencies in the point cloud and reduce the computational complexity of global modeling.
[0015] In step 4, the adaptive feature fusion module dynamically weights and fuses multi-scale local features and global features according to learnable fusion weights to generate fused features that combine local geometric details and global semantic consistency.
[0016] In step 5, the semantic category prediction results of each point in the point cloud are output based on the fusion features to achieve semantic segmentation of the 3D point cloud.
[0017] Step 1: Obtain an original point cloud P containing N points. Each original point cloud P consists of initial features composed of three-dimensional coordinates and color information.
[0018] Augmentation operations are performed on the original point cloud P data during the training phase;
[0019] The data augmentation operations include random rotation, scaling, and translation.
[0020] The first layer of the encoder takes the entire original point cloud P as input. Each subsequent layer of the encoder uses the farthest point sampling algorithm to select a set of center points from the output of the previous layer as anchor points for feature extraction in this layer.
[0021] In step 2, for each center point selected by the encoder Multiple local neighborhoods of different scales are constructed around the center point. A set of progressively increasing search radii is set, and the K nearest neighbor algorithm is used to find K neighboring points within each radius, so that the center point forms a neighborhood set with four different spatial receptive fields.
[0022] In step 2, feature extraction is completed by two sub-modules: local feature enhancement and joint pooling. For the neighborhood... Each point inside Calculate its distance from the center point The relative coordinate difference and feature difference are represented as:
[0023] ;
[0024] Two independent multilayer perceptrons are used to respectively and Mapping to a high-dimensional space and summing the results yields an encoding that integrates spatial and semantic differences. ,express:
[0025] ;
[0026] Will Original features of neighboring points The features are concatenated, then integrated and reduced in dimensionality using an MLP to output enhanced neighborhood features. ,express:
[0027] ;
[0028] Enhanced feature set Perform two pooling operations in parallel;
[0029] The steps of the two pooling operations include:
[0030] First, use max pooling to extract the most salient features:
[0031] ;
[0032] Next, attention pooling is performed for each user using a lightweight MLP and a softmax function. Calculate attention weights ,express:
[0033] ;
[0034] Then, attention-weighted summation is performed to obtain the attention pooling features. ,express:
[0035] ;
[0036] Finally, a learnable scalar parameter λ is introduced and normalized to a fusion weight using the Sigmoid function. The two pooling results are dynamically fused to obtain the final features of the center point at this scale. :
[0037] ;
[0038] After performing the above operations on the four scale neighborhoods respectively, the four features obtained will be... , , , The data is then stitched together and fused and channel-adjusted using a final MLP to output the multi-scale local aggregated features of this module. ,express:
[0039] .
[0040] Step 3 includes the following steps:
[0041] 1) Select a representative set of key points from all the points;
[0042] 2) Obtain key points and their features Then, sparse global attention calculation is performed.
[0043] Step 1) includes:
[0044] First, local features Perform an additional local feature enhancement operation to aggregate broader neighborhood information and obtain refined features. ,express:
[0045] ;
[0046] Where LFA stands for Local Feature Enhancement Operation;
[0047] Then, a learnable keypoint selection mechanism is adopted: Geometric encoding of point cloud coordinates The concatenation, mapped to features Z via a lightweight MLP, represents:
[0048] ;
[0049] Next, a similarity score S is calculated using a learnable matrix W and Z, and the attention weight matrix A is obtained through Softmax normalization, representing:
[0050] ;
[0051] Finally, the coordinates of all points P are summed using the weight matrix A to obtain the coordinates of the M key points. :
[0052] .
[0053] In step 2), the steps for calculating sparse global attention include:
[0054] First, the features of all points As query Q, key features As key K and value V;
[0055] Then, after linear projection, sparse self-attention is performed between the query point and the keypoint, representing:
[0056] ;
[0057] Where d is the attention dimension, and then global refined features are obtained through residual connections and MLP. :
[0058] .
[0059] In step 4, a learnable scalar parameter λ is introduced, which is mapped to a normalized fusion weight α using the Sigmoid function, representing:
[0060] ;
[0061] Where σ represents the Sigmoid function;
[0062] The final output features of this layer encoder By weighted sum calculation, it can be expressed as:
[0063] .
[0064] In step 5, upsampling is performed using methods such as nearest neighbor interpolation to gradually restore the number of points. Each level of decoder concatenates and fuses the upsampled features with the features transmitted from the corresponding layer of the encoder through skip connections. After all upsampling layers, the features are restored to the original number of point clouds. A prediction head composed of two MLPs is used to map the high-dimensional features to the semantic category space and output the category prediction probability of each point.
[0065] A 3D point cloud semantic segmentation architecture, including an encoder and a decoder;
[0066] The encoder consists of four stacked layers, each of which contains a multi-scale local feature aggregation module and a global feature refinement module.
[0067] The decoder consists of a recovery module, a fusion module, and a prediction head. The recovery module gradually restores the point cloud resolution through upsampling operations. The fusion module performs fusion by implementing skip connections with features from the corresponding level of the encoder. The prediction head consists of a multilayer perceptron and outputs semantic segmentation results.
[0068] Compared with the prior art, the present invention has the following beneficial effects:
[0069] 1. This invention effectively enhances the ability to express geometric structures and semantic information at different scales in point clouds by constructing multi-scale local neighborhoods and performing local feature enhancement, thereby enhancing the ability to identify multi-scale targets in complex scenes.
[0070] 2. This invention introduces a sparse global attention mechanism based on key points, which significantly reduces computational complexity while retaining global modeling capabilities, making it suitable for large-scale point cloud data processing;
[0071] 3. This invention achieves a dynamic balance between local and global features through an adaptive feature fusion strategy, effectively avoiding the problems of redundant information or loss of local details caused by static fusion methods.
[0072] 4. The overall structure of this invention has good robustness and generalization ability, and can achieve high-precision semantic segmentation in complex 3D scenes. Attached Figure Description
[0073] The following is a brief explanation of the content represented by each figure in this specification:
[0074] Figure 1 This is a schematic diagram of the MGR-Net point cloud semantic segmentation network structure proposed in this invention;
[0075] Figure 2 This is a flowchart illustrating the point cloud semantic segmentation method proposed in this invention;
[0076] Figure 3 This is a schematic diagram of the multi-scale local feature aggregation module structure proposed in this invention;
[0077] Figure 4 This is a schematic diagram of the local feature enhancement and joint pooling module structure;
[0078] Figure 5 This is a schematic diagram of the global refinement attention module proposed in this invention;
[0079] Figure 6 This is a schematic diagram of the local and global feature fusion strategy proposed in this invention;
[0080] Figure 7 The visualization results of the method of the present invention on the S3DIS dataset;
[0081] Figure 8 This is a visualization of the method of the present invention on the ScanNetV2 dataset. Detailed Implementation
[0082] The following description, with reference to the accompanying drawings, details the specific implementation of the present invention, including the shape and structure of each component, the relative positions and connections between the parts, the function and working principle of each part, the manufacturing process, and the operation and use methods, to help those skilled in the art to have a more complete, accurate, and in-depth understanding of the inventive concept and technical solution of the present invention.
[0083] This invention is a point cloud segmentation method based on multi-scale local feature aggregation and global feature refinement, which is applicable to large-scale point cloud semantic understanding in scenarios such as autonomous driving, robot environmental perception, and 3D reconstruction.
[0084] The overall approach involves first preprocessing the input 3D point cloud data and constructing a multi-scale local neighborhood structure for the point cloud. Within neighborhoods at different scales, local feature enhancement is achieved by jointly encoding the relative spatial position information and feature differences between the center point and neighboring points. A joint pooling strategy is then used to aggregate the neighborhood features, resulting in a multi-scale local feature representation. Next, based on the multi-scale local features, feature refinement is introduced, and a learnable keypoint selection mechanism is used to select representative keypoints from the point cloud, constructing a sparse global attention structure based on these keypoints to capture long-range dependencies and reduce the computational complexity of global modeling. Subsequently, an adaptive feature fusion module dynamically weights and fuses the multi-scale local and global features according to learnable fusion weights, generating fused features that combine local geometric details with global semantic consistency. Finally, based on the fused features, the semantic category prediction results for each point in the point cloud are output, achieving semantic segmentation of the 3D point cloud.
[0085] like Figure 1 The diagram shows the overall architecture of the point cloud semantic segmentation network (MGR-Net). The network employs an encoder-decoder structure. The encoder consists of four stacked layers, each containing a multi-scale local feature aggregation module and a global feature refinement module, accompanied by downsampling of the point cloud and an increase in the number of feature channels. The decoder gradually restores the point cloud resolution through upsampling operations and fuses it with the features from the corresponding layers of the encoder via skip connections. Finally, the semantic segmentation result is output through a prediction head composed of multilayer perceptrons.
[0086] The overall process of point cloud semantic segmentation method is as follows: Figure 2 As shown, it includes the following steps:
[0087] Step 1: First, preprocess and sample the input point cloud data;
[0088] Step 2: Construct multi-scale local regions and extract aggregated features to form a refined local feature representation;
[0089] Step 3: Select representative key points to perform efficient global long-range dependency modeling to obtain global refined features;
[0090] Step 4: Dynamically combine local and global features through an adaptive fusion mechanism;
[0091] Step 5: Output the semantic label for each point through upsampling and decoding operations to complete the segmentation.
[0092] Data preprocessing and sampling are performed to obtain an initial point cloud P containing N points. Each point contains 3D coordinates (x, y, z) and color (R, G, B), forming the initial features. During the training phase, data augmentation operations such as random rotation, scaling, and translation can be performed on the point cloud to improve the model's robustness. The first layer of the network encoder takes all N points as input. Each subsequent layer first uses a farthest point sampling algorithm to select a set of center points from the output of the previous layer as anchor points for feature extraction in this layer.
[0093] Constructing multi-scale local regions and extracting local features, such as Figure 3 As shown, for each center point selected in a certain layer of the encoder We construct multiple local neighborhoods of different scales around it. Specifically, we set a set of gradually increasing search radii. Within each radius, the K nearest neighbor algorithm is used to find K neighboring points, thus forming four neighborhood sets with different spatial receptive fields for the center point. .
[0094] The single ruler center is handled by two sub-modules: local feature enhancement and joint pooling. The structure is as follows: Figure 4As shown. For the neighborhood Each point inside Calculate its distance from the center point The relative coordinate difference and feature difference are calculated as follows:
[0095]
[0096] By using two independent multilayer perceptrons (MLPs) to respectively and Mapping to a high-dimensional space and summing the results yields an encoding that integrates spatial and semantic differences. The calculation is as follows:
[0097]
[0098] Then Original features of neighboring points The features are concatenated, then integrated and reduced in dimensionality using an MLP to output enhanced neighborhood features. :
[0099]
[0100] Enhanced feature set Two pooling operations are performed in parallel. First, max pooling is used to extract the most salient features:
[0101]
[0102] Second, attention pooling is used, firstly by applying a lightweight MLP and a softmax function to each... Calculate attention weights :
[0103]
[0104] Then, attention-weighted summation is performed to obtain the attention pooling features. :
[0105]
[0106] Finally, a learnable scalar parameter λ is introduced and normalized to a fusion weight using the Sigmoid function. The two pooling results are dynamically fused to obtain the final features of the center point at this scale. :
[0107]
[0108] After performing the above operations on the four scale neighborhoods respectively, the four features obtained will be... , , , The data is then stitched together and fused and channel-adjusted using a final MLP to output the multi-scale local aggregated features of this module. :
[0109]
[0110] This feature forms a robust point-level feature representation.
[0111] Global feature refinement based on key points, such as Figure 5 As shown, a representative set of key points is first selected from all points. First, local features are analyzed. Perform an additional local feature enhancement operation to aggregate broader neighborhood information and obtain refined features. :
[0112]
[0113] LFA stands for Local Feature Enhancement. Then, a learnable keypoint selection mechanism is employed: Geometric encoding of point cloud coordinates The concatenation is mapped to feature Z via a lightweight MLP:
[0114]
[0115] Next, a similarity score S is calculated using a learnable matrix W and Z, and the attention weight matrix A is obtained through Softmax normalization.
[0116]
[0117] Finally, the coordinates of all points P are summed using the weight matrix A to obtain the coordinates of the M key points. :
[0118]
[0119] Obtain key points and their features Then, sparse global attention is calculated. This involves calculating the features of all points. As query Q, key features As keys K and values V, sparse self-attention is performed between query points and keypoints after linear projection:
[0120]
[0121] Where d is the attention dimension. Global refined features are then obtained through residual connections and an MLP. :
[0122]
[0123] Adaptive fusion of local and global features, such as Figure 6 As shown, a learnable scalar parameter λ is introduced, which is mapped to a normalized fusion weight α using the Sigmoid function:
[0124]
[0125] Where σ represents the Sigmoid function. The final output features of this layer's encoder. Calculated by weighted sum:
[0126]
[0127] This mechanism enables the network to adaptively adjust its dependence on local and global information based on different input scenarios and levels.
[0128] After multiple encoder layers, the point cloud resolution decreases, but the semantic meaning of the features is enhanced. In the decoder, upsampling is performed using methods such as nearest neighbor interpolation to gradually restore the number of points. Each decoder layer concatenates and fuses the upsampled features with features from the corresponding encoder layer via skip connections to recover geometric details that may have been lost during encoding. Finally, after all upsampling layers, the features are restored to the original point cloud count. A prediction head consisting of two MLPs maps the high-dimensional features to the semantic category space, outputting the category prediction probability for each point, thus obtaining the complete point cloud semantic segmentation result.
[0129] To verify the effectiveness of this invention, comprehensive experiments were conducted on two large-scale indoor point cloud datasets, S3DIS and ScanNetV2. All experiments were performed on a server running Ubuntu 20.04, with hardware including an NVIDIA GeForce RTX 4090 (24GB) GPU and an Intel Core i9-13900KF CPU. The software environment was implemented using the PyTorch deep learning framework. During training, a label-smoothed cross-entropy loss function and the AdamW optimizer were used, with an initial learning rate set to 0.001, dynamically adjusted using a cosine annealing strategy. Batch size and training epochs were set in accordance with existing mainstream methods to ensure fair comparison.
[0130] The segmentation performance of MGR-Net on the S3DIS dataset Area 5 is as follows: Figure 7 As shown, the average intersection-over-union ratio reached 71.4%. The segmentation performance of MGR-Net on the ScanNetV2 dataset is as follows: Figure 8As shown, the average crossover-union ratio reached 71.5%, which is better than the current mainstream methods, proving the effectiveness and superiority of the module and network structure proposed in this invention in practical applications.
[0131] The present invention has been described above by way of example with reference to the accompanying drawings. Obviously, the specific implementation of the present invention is not limited to the above-described manner. Any non-substantial improvements made using the inventive concept and technical solution of the present invention, or the direct application of the inventive concept and technical solution of the present invention to other occasions without modification, are all within the protection scope of the present invention.
Claims
1. A point cloud segmentation method involving multi-scale local feature aggregation and global feature refinement, characterized in that, Includes the following steps: Step 1: Acquire and preprocess point cloud data to obtain a point cloud training dataset; Step 2: Construct a multi-scale local neighborhood structure for the point cloud, extract multi-scale local features, and form a point-level feature representation; Step 3: Select representative key points from the point cloud data for modeling, and perform long-range dependency modeling through the global feature refinement module to enhance global information; Step 4: Use the adaptive feature fusion module to dynamically weight and fuse local features and global refined features based on the learned fusion weights; Step 5: Through upsampling and feature propagation, predict semantic labels point by point and output the final point cloud semantic segmentation result.
2. The point cloud segmentation method for multi-scale local feature aggregation and global feature refinement according to claim 1, characterized in that: In step 1, the input 3D point cloud data is preprocessed, and a multi-scale local neighborhood structure of the point cloud is constructed. In step 2, local feature enhancement is achieved by jointly encoding the relative spatial location information and feature differences between the center point and neighboring points in the neighborhood at different scales. Then, the neighborhood features are aggregated by a joint pooling strategy to obtain a multi-scale local feature representation. In step 3, based on multi-scale local features, feature refinement operation is introduced, and representative key points are selected from the point cloud through a learnable key point selection mechanism to construct a sparse global attention structure based on key points, so as to capture long-range dependencies in the point cloud and reduce the computational complexity of global modeling. In step 4, the adaptive feature fusion module dynamically weights and fuses multi-scale local features and global features according to learnable fusion weights to generate fused features that combine local geometric details and global semantic consistency. In step 5, the semantic category prediction results of each point in the point cloud are output based on the fusion features to achieve semantic segmentation of the 3D point cloud.
3. The point cloud segmentation method for multi-scale local feature aggregation and global feature refinement according to claim 1 or 2, characterized in that: Step 1: Obtain an original point cloud P containing N points. Each original point cloud P consists of initial features composed of three-dimensional coordinates and color information. Augmentation operations are performed on the original point cloud P data during the training phase; The data augmentation operations include random rotation, scaling, and translation. The first layer of the encoder takes the entire original point cloud P as input. Each subsequent layer of the encoder uses the farthest point sampling algorithm to select a set of center points from the output of the previous layer as anchor points for feature extraction in this layer.
4. The point cloud segmentation method for multi-scale local feature aggregation and global feature refinement according to claim 3, characterized in that: In step 2, for each center point selected by the encoder Multiple local neighborhoods of different scales are constructed around the center point. A set of progressively increasing search radii is set, and the K nearest neighbor algorithm is used to find K neighboring points within each radius, so that the center point forms a neighborhood set with four different spatial receptive fields. In step 2, feature extraction is completed by two sub-modules: local feature enhancement and joint pooling. For the neighborhood... Each point within Calculate its distance from the center point The relative coordinate difference and feature difference are represented as: ; Two independent multilayer perceptrons are used to respectively and Mapping to a high-dimensional space and summing the results yields an encoding that integrates spatial and semantic differences. ,express: ; Will Original features of neighboring points The features are concatenated, then integrated and reduced in dimensionality using an MLP to output enhanced neighborhood features. ,express: ; Enhanced feature set Perform two pooling operations in parallel; The steps of the two pooling operations include: First, use max pooling to extract the most salient features: ; Next, attention pooling is performed for each user using a lightweight MLP and a softmax function. Calculate attention weights ,express: ; Then, attention-weighted summation is performed to obtain the attention pooling features. ,express: ; Finally, a learnable scalar parameter λ is introduced and normalized to a fusion weight using the Sigmoid function. The two pooling results are dynamically fused to obtain the final features of the center point at this scale. : ; After performing the above operations on the four scale neighborhoods respectively, the four features obtained will be... , , , The data is then stitched together and fused and channel-adjusted using a final MLP to output the multi-scale local aggregated features of this module. ,express: 。 5. The point cloud segmentation method for multi-scale local feature aggregation and global feature refinement according to claim 4, characterized in that: Step 3 includes the following steps: 1) Select a representative set of key points from all the points; 2) Obtain key points and their features Then, sparse global attention calculation is performed.
6. The point cloud segmentation method for multi-scale local feature aggregation and global feature refinement according to claim 5, characterized in that: Step 1) includes: First, local features Perform an additional local feature enhancement operation to aggregate broader neighborhood information and obtain refined features. ,express: ; Where LFA stands for Local Feature Enhancement Operation; Then, a learnable keypoint selection mechanism is adopted: Geometric encoding of point cloud coordinates The concatenation, mapped to features Z via a lightweight MLP, represents: ; Next, a similarity score S is calculated using a learnable matrix W and Z, and the attention weight matrix A is obtained through Softmax normalization, representing: ; Finally, the coordinates of all points P are summed using the weight matrix A to obtain the coordinates of the M key points. : 。 7. The point cloud segmentation method for multi-scale local feature aggregation and global feature refinement according to claim 6, characterized in that: In step 2), the steps for calculating sparse global attention include: First, the features of all points As query Q, key features As key K and value V; Then, after linear projection, sparse self-attention is performed between the query point and the keypoint, representing: ; Where d is the attention dimension, and then global refined features are obtained through residual connections and MLP. : 。 8. The point cloud segmentation method for multi-scale local feature aggregation and global feature refinement according to claim 1, 2, or 7, characterized in that: In step 4, a learnable scalar parameter λ is introduced, which is mapped to a normalized fusion weight α using the Sigmoid function, representing: ; Where σ represents the Sigmoid function; The final output features of this layer encoder By weighted sum calculation, it can be expressed as: 。 9. The point cloud segmentation method for multi-scale local feature aggregation and global feature refinement according to claim 8, characterized in that: In step 5, upsampling is performed using methods such as nearest neighbor interpolation to gradually restore the number of points. Each level of decoder concatenates and fuses the upsampled features with the features transmitted from the corresponding layer of the encoder through skip connections. After all upsampling layers, the features are restored to the original number of point clouds. A prediction head composed of two MLPs is used to map the high-dimensional features to the semantic category space and output the category prediction probability of each point.
10. A three-dimensional point cloud semantic segmentation architecture, characterized in that, Includes encoders and decoders; The encoder consists of four stacked layers, each of which contains a multi-scale local feature aggregation module and a global feature refinement module. The decoder consists of a recovery module, a fusion module, and a prediction head. The recovery module gradually restores the point cloud resolution through upsampling operations. The fusion module performs fusion by implementing skip connections with features from the corresponding level of the encoder. The prediction head consists of a multilayer perceptron and outputs semantic segmentation results.