Battery swapping robot target point cloud segmentation method based on multi-scale attention aggregation

By using a binocular structured light depth camera and multi-scale attention aggregation technology, the accuracy and efficiency issues of target fastener point cloud segmentation during electric vehicle battery swapping were solved, achieving efficient target fastener segmentation and extraction, and improving battery swapping efficiency.

CN120451544BActive Publication Date: 2026-06-23SOUTHEAST UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SOUTHEAST UNIV
Filing Date
2025-04-27
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

In the battery swapping process of new energy electric vehicles, how to quickly and accurately segment and extract the point cloud of the target fastener to improve the battery swapping efficiency, especially in effectively removing background noise interference from the large-scale point cloud data acquired by depth sensors.

Method used

A binocular structured light depth camera was used to acquire RGB images, depth images, and point cloud data of the target fastener, and the FastSeg3D dataset was constructed. Noise points were removed by radius filtering and DBSCAN clustering. Combined with a multi-scale attention aggregation module, local feature aggregation and random sampling were used to reduce computational complexity and achieve high-precision segmentation.

Benefits of technology

It significantly improves the segmentation accuracy and computational efficiency of target fasteners, increases the operating speed and precision of battery swapping robots, reduces computational complexity, and enables the processing of large-scale point cloud data.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120451544B_ABST
    Figure CN120451544B_ABST
Patent Text Reader

Abstract

The application discloses a target point cloud segmentation method for battery replacing robots based on multi-scale attention aggregation, and has the following steps: 1, using a binocular structured light depth camera to collect the RGB image, depth image and three-dimensional point cloud data of a target fastener, and fusing to generate a FastSeg3D dataset; 2, a two-stage preprocessing method is proposed, in which radius filtering is used to remove noise points, and DBSCAN density clustering is used to remove background point clusters far away from the target; 3, a local feature aggregation module is used in a network encoder to extract geometric features, and a random sampling strategy is used to reduce the calculation complexity; 4, a multi-scale attention aggregation module is embedded in the skip connection of the encoder and the decoder, channel and spatial attention units are used to fuse features, and adaptive weight weighting is realized on the features of each layer of the encoder; and 5, nearest neighbor interpolation up-sampling is adopted to restore the original point cloud resolution, and a high-quality point cloud target is obtained by outputting segmentation semantic labels. The application reduces the operation time of the battery replacing robot, and solves the balance problem between the segmentation speed and the accuracy of large-scale target point clouds.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of intelligent battery swapping for electric vehicles, and specifically discloses a target point cloud segmentation method for battery swapping robots based on multi-scale attention aggregation. Background Technology

[0002] In the battery swapping process of new energy electric vehicles, how to further accelerate battery replacement speed and reduce operation time is a key research issue in the field of automated battery swapping technology. Target fasteners are crucial mechanical components connecting the battery pack to the vehicle body; therefore, the rapid disassembly and installation of target fasteners directly affects the efficiency of the battery swapping robot in accurately disassembling the battery panels. In the target fastener segmentation and extraction task, LiDAR or depth cameras are typically used as sensors. While LiDAR performs excellently in large-scale point cloud acquisition, its accuracy is limited and it is easily affected by noise when detecting small targets. In the battery swapping station scenario, the target fasteners are small in size and located at the edge or bottom of the vehicle's battery panels. Therefore, depth sensors are the optimal choice for processing small targets due to their high resolution and close-range accuracy. However, because the target point cloud acquired by high-precision depth cameras is large in scale, and there is a significant difference in point cloud between the battery panel and the target fastener, background noise in the point cloud leads to uneven point cloud density distribution in the target area, significantly reducing the robustness of the segmentation model. Therefore, how to effectively remove background point clouds and perform fast and accurate target segmentation is an important issue for improving the accuracy and efficiency of battery swapping.

[0003] Technical comparison with patent CN119671908A "A dynamic point cloud classification and prediction label correction method for an inspection robot";

[0004] Patent CN119671908A proposes a dynamic point cloud segmentation and predictive label correction method for inspection robots, primarily applied to train undercarriage inspection robot scenarios. It aims to address the interference of dynamic objects (such as personnel) on the positioning accuracy of 3D laser SLAM systems in shared human-machine maintenance trench environments. This method improves the positioning accuracy and operational efficiency of inspection robots in narrow, highly repetitive environments by identifying and separating dynamic point clouds containing moving personnel, reducing the need for frequent personnel entry and exit from maintenance trenches. This patent proposes a target point cloud segmentation method for battery swapping robots based on multi-scale attention aggregation, primarily applied to target fastener identification and segmentation scenarios in electric vehicle battery swapping environments. By segmenting fasteners and background point clouds with high precision, it provides support for subsequent pose estimation and accurate disassembly and assembly, thereby improving the efficiency of electric vehicle battery swapping. This method focuses on processing small-volume, geometrically complex target fasteners, leveraging the high-resolution advantage of depth cameras to address problems such as massive point cloud data and background noise interference. The application scenarios of these two methods are fundamentally different.

[0005] Patent CN119671908A uses solid-state LiDAR to collect point cloud data, clips the point cloud within the inspection channel using a pass-through filter, and projects the 3D point cloud onto a polar coordinate system to form a bird's-eye view, thus balancing the spatial distribution non-uniformity of the point cloud. A transformation matrix is ​​determined using laser odometry, and a sliding time window is constructed to integrate data from the current frame and previous / next frames. Dynamic object motion features are extracted by examining height differences in the bird's-eye view. Dynamic object recognition is performed based on the dual-branch structure and appearance-motion co-attention mechanism of PolarNet. Finally, a combination of density-based spatial clustering (DBSCAN) and cloth-based simulated filtering (CSF) is used to correct the predicted labels, improving the accuracy of dynamic point cloud segmentation. This patent uses a binocular structured light depth camera to acquire RGB images, depth maps, and point cloud data of the target fastener, constructing the FastSeg3D dataset. It employs a radius filtering algorithm to remove outliers and local noise, and combines DBSCAN density clustering to remove distant background point clusters. An encoder-decoder architecture is constructed, extracting point cloud geometric features through a local feature aggregation module and reducing computational complexity through a random sampling strategy. A multi-scale attention aggregation module is introduced, fusing features of different resolutions through channel attention units and spatial attention units to achieve adaptive weighting. The multi-scale attention module is embedded in the skip connections between the encoder and decoder, and point cloud density is restored through nearest neighbor interpolation upsampling. Finally, the network outputs semantic labels to separate the target fastener from the background point cloud. The two technical solutions are fundamentally different.

[0006] Patent CN119671908A addresses dynamic point cloud segmentation. First, a pass-through filter is used to limit the point cloud processing range, projecting the 3D point cloud onto a bird's-eye view in a polar coordinate system. A sliding window mechanism is constructed, calculating the height differences of grid cells within consecutive time windows (W1 and W2) to obtain motion features, which are represented as residuals. An appearance-motion co-attention module is employed, including a co-attention gate and a motion-guided attention module based on motion cues, enhancing cross-modal interaction between appearance and motion features. Finally, in the label correction stage, the DBSCAN clustering algorithm and cloth simulation filtering are used to correct mislabels, improving segmentation accuracy by identifying high-density regions and removing isolated noise. This patent focuses on target point cloud segmentation in static scenes. First, radius filtering is used to remove isolated noise points and outliers. Then, DBSCAN density clustering is applied to identify and remove high-density background point clusters far from the target. In terms of network structure, an encoder-decoder architecture is built based on RandLA-Net. The encoder extracts point cloud geometric features through a local feature aggregation module and uses a random sampling strategy to reduce computational complexity. An innovative multi-scale attention aggregation module is introduced, fusing features of different resolutions through channel attention (using global average pooling, max pooling, and shared fully connected layers) and spatial attention (using multi-scale convolutional kernels) to achieve detail enhancement. The decoder restores point cloud density through nearest neighbor interpolation upsampling and reconstructs target details using multi-scale features. There is a fundamental difference in the technical methods used in these two approaches.

[0007] Patent CN119671908A achieves significant results in dynamic point cloud segmentation and label correction. In a simulated inspection ditch environment, the algorithm's mIoU (mean intersection-over-union ratio) reaches 89.21%, and Acc (accuracy) reaches 95.62%, showing a significant improvement compared to PointNet and PointNet++. This method can effectively identify and segment dynamic targets such as personnel, minimize the interference of dynamic point clouds on SLAM, and enable the inspection robot to maintain stable positioning accuracy in environments where personnel coexist. The label correction technology correctly handles misidentification, improving the robustness of the system. The feature extraction method based on 2D BEV and the label correction strategy based on spatial clustering improve the system's processing efficiency and accuracy for low-cost solid-state LiDAR data. This patent demonstrates superior performance in fine segmentation of target fasteners. Through a two-stage preprocessing approach using radius filtering and DBSCAN density clustering, it significantly improves the signal-to-noise ratio of the target point cloud, effectively removing over 80% of background noise points. The multi-scale attention aggregation module significantly enhances the model's segmentation accuracy for complex geometric structures such as fastener edges and threads, improving edge accuracy compared to benchmark models. Random sampling and local feature aggregation strategies reduce the computational complexity of point cloud processing from O(n log n) to O(n log n). 2The processing time was reduced to O(n), enabling the processing of large-scale point clouds (>1 million points) on ordinary computing hardware, with a processing speed improvement of 5-8 times. Facing challenging scenarios such as changing viewpoints, varying lighting conditions, and partial occlusion, the model maintained stable segmentation performance, providing high-quality target point clouds for battery swapping robots and significantly improving the accuracy and speed of subsequent registration and pose estimation. The two methods differ fundamentally in their technical effects. Summary of the Invention

[0008] To address the aforementioned technical issues, this invention proposes a target point cloud segmentation method for battery swapping robots based on multi-scale attention aggregation. This method can improve the overall replacement time of battery swapping robots and overcomes the limitation of not being able to directly, quickly, and accurately process large-scale target point cloud segmentation.

[0009] To achieve the above objectives, the technical solution adopted by the present invention is as follows:

[0010] A target point cloud segmentation method for battery swapping robots based on multi-scale attention aggregation includes the following steps:

[0011] (1) Use a binocular structured light depth camera to capture images, obtain RGB images, depth images and three-dimensional point cloud data of the target fastener, and construct a target fastener point cloud dataset FastSeg3D containing multi-scene annotations, denoted as point cloud P0.

[0012] (2) Preprocess the original point cloud P0 data, specifically including: using the radius filtering algorithm to filter the point cloud P0, removing outliers and local noise, to obtain the preprocessed point cloud P1; using the density-based DBSCAN clustering algorithm to cluster the point cloud P1, removing irrelevant point clusters in the background of the distant solar panel, to obtain the point cloud P2.

[0013] (3) Obtain the neighborhood feature representation F1 of each point in the point cloud through the local spatial coding unit, and use the attention pooling mechanism to filter and aggregate the features to obtain the attention score. Attention score The weighted aggregated features are obtained by multiplying the features element-wise with the neighborhood features F1. Features obtained using the local feature aggregation module Random sampling is performed to obtain point cloud P3 to reduce computational load. The encoder is implemented by stacking multiple sets of local spatial coding and attention pooling units.

[0014] (4) The multi-scale attention aggregation module efficiently captures long-range dependencies through channel and spatial attention mechanisms. This module receives feature tensors from high-resolution, medium-resolution, and low-resolution encoders and concatenates them to obtain feature F2, which is then input into the channel attention unit and the spatial attention unit, respectively. The channel attention unit generates channel description vectors through global average pooling and dynamically adjusts the channel weights by combining fully connected layers and activation functions to obtain the channel weight vector W. C The spatial attention unit uses parallel 3×3, 5×5, and 7×7 convolutional kernels to extract local features from different receptive fields, resulting in a spatial weight vector W. s The multi-scale attention aggregation module is embedded into the skip connection between the encoder and the decoder. The output features of each layer of the encoder are concatenated and then input into the multi-scale attention aggregation module for processing, and then concatenated layer by layer with the upsampled features of the decoder.

[0015] (5) The decoder takes the feature map after downsampling and feature extraction as input, and uses nearest neighbor interpolation upsampling to restore the features to the resolution of the original point cloud layer by layer to obtain point cloud P4;

[0016] (6) The semantic labels output by the segmentation network are indexed according to the coordinate dimension and feature dimension of the point cloud P4. The point cloud regions corresponding to different label values ​​are selected. Then, based on the selection results, the original point cloud is split into two independent point cloud subsets, representing the solar panel and the target fastener, respectively.

[0017] As a further improvement of the present invention, the method for setting up the binocular structured light depth camera in step (1) is as follows:

[0018] (1-1) Mount the camera on the end of the battery swapping robot and move it along the track to a position 40-60cm below the target battery pack to collect data;

[0019] (1-2) Set the camera depth range to 50-3000mm, the depth map resolution to 960×600, and the color map resolution to 1920×1080. Ensure the coordinate system transformation accuracy through calibration tests.

[0020] (1-3) Data collection covers locked and unlocked states, different lighting conditions, shooting distance, tilt angle and clutter interference scenarios to ensure the generalizability of the dataset.

[0021] As a further improvement of the present invention, the preprocessing in step (2) includes:

[0022] (2-1) Radius filtering: Set the search radius r = 1.0 mm and the minimum number of neighbor points n = 10, remove noise points with neighborhood density below the threshold, and obtain point cloud P1;

[0023] (2-2) DBSCAN density clustering: Based on the k-nearest neighbor distance distribution curve, the optimal neighborhood radius ε and the minimum number of points minPts are determined, low-density background point clusters are identified and removed, the signal-to-noise ratio of the target point cloud is improved, and the point cloud P2 is obtained.

[0024] As a further improvement of the present invention, the encoder described in step (3) is specifically implemented as follows:

[0025] (3-1) The Local Spatial Encoding Unit first calculates the neighborhood index of each point in the preprocessed point cloud P2 using the K-nearest neighbor algorithm. Then, based on the spatial location and neighborhood relationship of the point, it calculates the local features of the points in the neighborhood. Next, it extracts the local geometric information features F1 through convolution. In this way, the module can capture the fine-grained geometric shape in the point cloud, including the distance between points, normal vectors, and other local geometric features, thereby enhancing the network's sensitivity and representation ability to local features and effectively improving the accuracy of point cloud processing.

[0026] (3-2) The attention pooling unit filters and aggregates the output features F1 of the local spatial encoding unit through global average pooling to generate attention scores. These weighting coefficients are used to dynamically adjust the weight W of each point. i This allows the model to focus more on the parts of the point cloud containing important geometric information in subsequent stages, thus increasing the attention score. The weighted aggregated features are obtained by multiplying the features element-wise with the neighborhood features F1.

[0027] (3-3) Random downsampling is performed on the point cloud after local spatial feature aggregation. A random sampling strategy with linear complexity is used to select points from the point cloud proportionally. First, the sampling rate is set to retain 10%-20% of the points. Then, sampling points are selected from the point cloud in a uniform distribution manner to ensure that the computational complexity is reduced without losing key geometric features. A weighted strategy is adopted to prioritize the retention of points with important geometric information, such as edges or target regions, in order to further improve the quality of the sampled point cloud and obtain the sampled point cloud P3.

[0028] As a further improvement of the present invention, the specific implementation of the multi-scale attention aggregation module in step (4) is as follows:

[0029] (4-1) Receive the feature tensors from the high-resolution, medium-resolution, and low-resolution encoders output from each layer of the encoder, and concatenate them to obtain the feature F2 = [B, N, 1, C]. fused First, the features F2 are uniformly reduced to the shared latent space D = [C] using 1×1 convolution. out / α], where α is the channel compression factor, resulting in feature F′2=[B,N,1,C] fused[4], and then F′2 is input to the channel attention unit and the spatial attention unit respectively;

[0030] (4-2) Channel Attention Unit: Compresses the input feature F′2 along the spatial dimension H×W to generate the channel description vector F. avg F max Pooling is then performed, followed by two nonlinear transformations on the pooling result through a fully connected layer containing two shared convolutions to obtain the feature F′. avg F′ max The outputs of the two branches are summed and passed through a sigmoid activation function to obtain the channel attention weights W. C Ultimately, the channel attention weight W is used. C The output feature F′ is obtained by scaling the input feature F2 channel by channel. C .

[0031] (4-3) Spatial Attention Unit: 3×3, 5×5, and 7×7 convolutional kernels are applied to the input feature F′2 to extract multi-scale local features. The three features are concatenated and then dimensionality-reduced by 1×1 convolution to generate feature group F. used , will F used The input spatial attention module performs adaptive reweighting along the spatial dimension. First, it calculates the mean and maximum values ​​along the channel dimension to generate two spatial attention basis vectors F. used_avg F used_max Then, spatial context relation W is extracted through 7×7 convolution. S Ultimately, through spatial attention weight W S The output feature F′ is obtained by scaling the input feature F′2 channel by channel. S Enhance response in key areas.

[0032] As a further improvement of the present invention, the specific implementation of the decoder in step (5) is as follows:

[0033] (5-1) The input feature matrix F and the pre-prepared nearest neighbor index for storing each target point are the interpolation index I. Calculate the interpolated feature matrix F'.

[0034] (5-2) The decoder optimizes the edge segmentation accuracy by fusing multi-scale features F' layer by layer, improves the ability to preserve the geometric details of the target fastener thread, and finally obtains the accurate segmentation result point cloud P4.

[0035] As a further improvement of the present invention, the point cloud subset extraction method in step (6) is as follows:

[0036] (6-1) Output the coordinates of point cloud P4 based on the semantic label value (0 / 1), separate the point clouds of the solar panel and the target fastener, and obtain point cloud P5;

[0037] (6-2) The extracted point cloud data is saved in PLY or PCD format, which can be directly called by subsequent registration algorithms.

[0038] Beneficial effects:

[0039] This invention discloses a patented method for target point cloud segmentation in battery swapping robots based on multi-scale attention aggregation. This method acquires RGB images, depth images, and point cloud data of the target fastener using a binocular structured light depth camera, constructing a FastSeg3D dataset covering various complex scenes, providing high-quality data support for model training. In the data preprocessing stage, radius filtering and DBSCAN density clustering algorithms are used to effectively remove outliers and distant background noise, significantly improving the signal-to-noise ratio of the target point cloud. In the point cloud segmentation stage, based on an improved MSA-RandLA-Net network, combined with a multi-scale attention aggregation module, random sampling and local feature aggregation significantly reduce computational complexity while preserving the geometric details of the point cloud. The multi-scale attention aggregation module enhances the model's ability to handle complex target edges through channel attention weighting and spatial attention weighting, improving segmentation accuracy. Attached Figure Description

[0040] Figure 1 This is a flowchart of the method disclosed in this invention. Detailed Implementation

[0041] The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments:

[0042] This invention provides a patented method for segmenting target point clouds of battery swapping robots based on multi-scale attention aggregation, which can improve the overall replacement time of battery swapping robots and overcome the limitation of not being able to directly, quickly and accurately process large-scale target point cloud segmentation.

[0043] As one embodiment of the present invention, the present invention provides a patented method for target point cloud segmentation of a battery swapping robot based on multi-scale attention aggregation, wherein the target fastener segmentation flowchart is as follows: Figure 1 As shown, this can be achieved through the following steps:

[0044] (1) Use a binocular structured light depth camera to capture images, obtain RGB images, depth images and three-dimensional point cloud data of the target fastener, and construct a target fastener point cloud dataset FastSeg3D containing multi-scene annotations, denoted as point cloud P0.

[0045] (2) Preprocess the original point cloud P0 data, specifically including: using the radius filtering algorithm to filter the point cloud P0, removing outliers and local noise, to obtain the preprocessed point cloud P1; using the density-based DBSCAN clustering algorithm to cluster the point cloud P1, removing irrelevant point clusters in the background of the distant solar panel, to obtain the point cloud P2.

[0046] (3) Obtain the neighborhood feature representation F1 of each point in the point cloud through the local spatial coding unit, and use the attention pooling mechanism to filter and aggregate the features to obtain the attention score. Attention score The weighted aggregated features are obtained by multiplying the features element-wise with the neighborhood features F1. Features obtained using the local feature aggregation module Random sampling is performed to obtain point cloud P3 to reduce computational cost. The encoder is implemented by stacking multiple sets of local spatial coding and attention pooling units.

[0047] (4) The multi-scale attention aggregation module efficiently captures long-range dependencies through channel and spatial attention mechanisms. This module receives feature tensors from high-resolution, medium-resolution, and low-resolution encoders, concatenates them to obtain feature F2, and inputs them into the channel attention unit and spatial attention unit, respectively. The channel attention unit generates channel description vectors through global average pooling, and dynamically adjusts the channel weights by combining fully connected layers and activation functions to obtain the channel weight vector W. C The spatial attention unit uses parallel 3×3, 5×5, and 7×7 convolutional kernels to extract local features from different receptive fields, resulting in a spatial weight vector W. s The multi-scale attention aggregation module is embedded into the skip connection between the encoder and the decoder. The output features of each layer of the encoder are concatenated and then input into the multi-scale attention aggregation module for processing, and then concatenated layer by layer with the upsampled features of the decoder.

[0048] (5) The decoder takes the feature map after downsampling and feature extraction as input, and uses nearest neighbor interpolation upsampling to restore the features to the resolution of the original point cloud layer by layer to obtain point cloud P4;

[0049] (6) The semantic labels output by the segmentation network are indexed according to the coordinate dimension and feature dimension of the point cloud P4, and the point cloud regions corresponding to different label values ​​are selected. Then, based on the selection results, the original point cloud is split into two independent point cloud subsets, representing the solar panel and the target fastener, respectively.

[0050] The method for setting up the binocular structured light depth camera in step (1) is as follows:

[0051] (1-1) The camera is mounted on the end of the battery swapping robot and moves along the track to a position 40-60cm below the target battery pack to collect data;

[0052] (1-2) Set the camera depth range to 50-3000mm, the depth map resolution to 960×600, and the color map resolution to 1920×1080. Ensure the coordinate system transformation accuracy through calibration tests.

[0053] (1-3) Data collection covers locked and unlocked states, different lighting conditions, shooting distance, tilt angle and clutter interference scenarios to ensure the generalizability of the dataset.

[0054] The point cloud data preprocessing method used in step (2) includes the following steps:

[0055] (2-1) For any point P in the point cloud i Calculate the number N of its neighboring points within a radius r. i If N i <k min If the condition is not met, the point is identified as a noise point and removed from the point cloud. The mathematical expression is as follows:

[0056]

[0057] Where I(·) is an indicator function, taking a value of 1 when the condition is true and 0 otherwise. This is achieved by adjusting the radius r and the minimum neighbor number k. min It can dynamically adapt to the characteristics of point cloud data in different scenarios, and effectively remove isolated points and low-density points in the point cloud.

[0058] (2-2) For any point p in the FastSeg3D dataset i Define its domain N ε (p i Let ε be the set of all points within a distance ε (neighborhood radius).

[0059] N ε (p i )={q∈D|dist(p i ,q)≤ε}

[0060] Where D is the point cloud dataset. dist(p,q) is the Euclidean distance between points p and q.

[0061] (2-3) If point p i Domain N ε (p i The number of points contained within ) is less than MinPts objects, but within the core object p j In the field of [the domain], it is called p i Boundary point, also called p i By p j Direct density is attainable. Otherwise, point p iThis is the noise point.

[0062] (2-4) If and only if there exists a path connected by the core points that satisfies: Make p1 = p, p n =q, and dist(p) i ,p i+1 When q ≤ ε (1 ≤ i ≤ n), point q is said to be derived from point p. i Density reachability. If there exist objects o∈D such that objects q and p are both density reachable from o with respect to ε and MinPts, then objects q and p are said to be density connected.

[0063] (2-5) Clusters are searched by examining the neighborhood of each point in the point cloud. If the neighborhood of point p contains more than MinPts points, a new cluster is created with p as the core object. DBSCAN then iteratively aggregates density-reachable objects from these core objects, involving the merging of density-reachable clusters. This process is repeated until all core points have been visited and no new points can be added to any cluster, at which point the process ends. Points that do not belong to any cluster are considered noise.

[0064] The encoder in step (3) includes the following steps:

[0065] (3-1) Finding neighboring points: In order to improve efficiency, for the i-th point, the index of its neighborhood points is first collected by the K-Nearest Neighbors (KNN) algorithm based on the Euclidean distance between points.

[0066] (3-2) Relative point position encoding: For the K nearest points {P} of the center point i l ...P i k ...P i K For each point in}, we explicitly encode its relative position as follows:

[0067]

[0068] Where p i and P i K These are the x, y, and z coordinates of a point. This is a concatenation operation; ||·|| calculates the Euclidean distance between adjacent points and the center point, r. i k It is obtained by encoding the positions of adjacent points.

[0069] (3-3) Point feature enhancement: For each neighboring point P i K The encoded relative point position r ik Its corresponding point feature f i k The features are concatenated to obtain the enhanced feature vector.

[0070] (3-3) Finally, the local spatial coding unit outputs a new set of neighboring features. It is the center point p i The local geometry is explicitly encoded.

[0071] (3-4) Calculate the attention score: given a set of local features A unique attention score is learned for each feature through a shared function g(). Essentially, function g() consists of a shared multilayer perceptron (MLP) followed by a softmax function. Its formal definition is as follows:

[0072]

[0073] Where W is the learnable weight of the shared multilayer perceptron.

[0074] (3-5) Random downsampling: From the input point cloud P = {P1, P2, ..., P...} n In the subset S = {P}, K points are randomly selected from a uniform distribution to form a subset S = {P}. σ(1) ,P σ(2) ,...,P σ(k)}, where σ(i) is a random index generated independently from the interval [1,N], satisfying σ(i)~Uniform(1,N).

[0075] Step (4) of the multi-scale attention aggregation module includes the following steps:

[0076] (4-1) Receive the feature tensors from the high-resolution, medium-resolution, and low-resolution encoders output from each layer of the encoder, and concatenate them to obtain the feature F2 = [B, N, 1, C]. fused First, the features F2 are uniformly reduced to the shared latent space D = [C] using 1×1 convolution. out / α], where α is the channel compression factor, resulting in feature F′2=[B,N,1,C] fused / 4]. Then, F′2 is input to the channel attention unit and the spatial attention unit, respectively.

[0077] (4-2) Channel Attention Unit: Compresses the input feature F' along the spatial dimension H×W to generate the channel description vector F. avg F max :

[0078]

[0079]

[0080] For pooled F avg F max The features undergo two non-linear transformations on the pooling results through fully connected layers (two shared convolutional layers):

[0081] F′ avg =W2·ReLU(W1·(F avg ))

[0082] F′ max =W2·ReLU(W1·(F max ))

[0083] Where W1∈R C / r×C W2∈R C×C / r , where r is the channel compression factor.

[0084] The outputs of the two branches are summed and passed through a Sigmoid activation function to obtain the channel attention weights W. C ∈R B ×C×1×1 :

[0085] W C =σ(F′) avg +F' max )

[0086] Where σ(·) is the Sigmoid function.

[0087] Final output feature F′ C Channel-wise scaling of the input feature F' is performed using channel attention weights:

[0088] F′ C =F'⊙W C

[0089] Where ⊙ represents channel-by-channel multiplication, and the feature map of each channel is multiplied by W. C The corresponding scalar weights are applied. This operation highlights the characteristic responses of important channels and suppresses redundant information.

[0090] (4-3) Spatial Attention Unit: Apply 3×3, 5×5, and 7×7 convolutional kernels to the input feature F' to extract multi-scale local features and generate feature sets:

[0091] F 3×3 =Conv 3×3 (F'),F 5×5 =Conv 5×5 (F'),F 7×7=Conv 7×7 (F')

[0092] By adding elements one by one:

[0093] F used =F 3×3 +F 5×5 +F 7×7

[0094] F used The input spatial attention module performs adaptive reweighting along the spatial dimension. First, it calculates the mean and maximum values ​​along the channel dimension to generate two spatial attention basis vectors:

[0095]

[0096] Output tensor F used_avg F used_max ∈R B×C×1×1 This represents the global and local salient regions. The results of average pooling and max pooling are concatenated along the channel dimension using Concat(F). used_avg ,F used_max ∈R B×2×H×W Then, spatial context relations are extracted using 7×7 convolution:

[0097]

[0098] Among them W i,j The parameters are 7×7 convolution kernel, σ(·) is the sigmoid function, and the final output is the spatial attention weight W. S ∈R B×1×H×W .

[0099] Final output feature F′ S Through spatial attention weight W S For input features F used Perform channel-by-channel scaling to enhance the response in critical areas:

[0100] F′ S =F used ⊙W S

[0101] This operation focuses the network on areas with significant geometric structures such as the edges and threads of the target fastener, thus suppressing background noise.

[0102] (4-4) Fusion of channel attention units and spatial attention units: Finally, the weighted feature F′ S Features F′ weighted by channel attention C The features are added together and then regenerated to the target number of channels via a 1×1 convolution, thus completing the adaptive fusion of cross-scale features.

[0103] The decoder in step (5) includes the following steps:

[0104] (5-1) Given the input feature matrix: F∈R B×N×d Where B is the batch size, N is the number of points, and d is the feature dimension; the pre-prepared nearest neighbor index for each target point is the interpolation index I∈R. B×M Where M is the number of points after upsampling; the interpolated feature matrix F'∈R B×M×d Calculated in the following way:

[0105] F'[b,m,:]=F[b,I[b,m],:]

[0106] Specifically, for each batch b and target point m, the original features corresponding to index I[b,m] are directly copied to the interpolated feature matrix. This operation expands the feature tensor from [B,N,d] to [B,M,d], significantly increasing the point cloud density while preserving local feature consistency. Combined with a subsequent multi-scale attention aggregation module, the interpolated features are semantically enhanced and spatially refined using transposed convolution, ultimately achieving feature reconstruction from sparse high-level semantics to dense low-level geometry.

[0107] The point cloud extraction in step (6) includes the following steps:

[0108] (5-1) Through layer-by-layer feature fusion of the decoder, the final output is the segmentation result point cloud P. seg ∈R N×2 The semantic label of each point indicates whether it belongs to the solar panel or the target fastener:

[0109] P seg =Softmax(FC(F output ))

[0110] In this context, FC stands for fully connected layer, and Softmax is the normalization operation.

[0111] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any other way. Any modifications or equivalent changes made based on the technical essence of the present invention shall still fall within the scope of protection claimed by the present invention.

Claims

1. A target point cloud segmentation method for battery swapping robots based on multi-scale attention aggregation, characterized in that, Includes the following steps: (1) Use a binocular structured light depth camera to capture images, obtain RGB images, depth images and three-dimensional point cloud data of the target fastener, and construct a target fastener point cloud dataset FastSeg3D containing multi-scene annotations, denoted as point cloud P0. (2) Preprocess the original point cloud P0 data, specifically including: using the radius filtering algorithm to filter the point cloud P0, removing outliers and local noise, to obtain the preprocessed point cloud P1; using the density-based DBSCAN clustering algorithm to cluster the point cloud P1, removing irrelevant point clusters in the background of the distant solar panel, to obtain the point cloud P2. (3) Obtain the neighborhood feature representation F1 of each point in the point cloud through the local spatial coding unit, and use the attention pooling mechanism to filter and aggregate the features to obtain the attention score. Attention score The weighted aggregated features are obtained by multiplying the features element-wise with the neighborhood features F1. Features obtained using the local feature aggregation module Random sampling is performed to obtain point cloud P3 to reduce computational load. The encoder is implemented by stacking multiple sets of local spatial coding and attention pooling units. (4) The multi-scale attention aggregation module efficiently captures long-range dependencies through channel and spatial attention mechanisms. This module receives feature tensors from high-resolution, medium-resolution, and low-resolution encoders and concatenates them to obtain feature F2, which is then input into the channel attention unit and the spatial attention unit, respectively. The channel attention unit generates channel description vectors through global average pooling and dynamically adjusts the channel weights by combining fully connected layers and activation functions to obtain the channel weight vector W. C The spatial attention unit uses parallel 3×3, 5×5, and 7×7 convolutional kernels to extract local features from different receptive fields, resulting in a spatial weight vector W. s The multi-scale attention aggregation module is embedded into the skip connection between the encoder and the decoder. The output features of each layer of the encoder are concatenated and then input into the multi-scale attention aggregation module for processing, and then concatenated layer by layer with the upsampled features of the decoder. (5) The decoder takes the feature map after downsampling and feature extraction as input, and uses nearest neighbor interpolation upsampling to restore the features to the resolution of the original point cloud layer by layer to obtain point cloud P4; (6) The semantic labels output by the segmentation network are indexed according to the coordinate dimension and feature dimension of the point cloud P4. The point cloud regions corresponding to different label values ​​are selected. Then, based on the selection results, the original point cloud is split into two independent point cloud subsets, representing the solar panel and the target fastener, respectively.

2. The target point cloud segmentation method for battery swapping robots based on multi-scale attention aggregation according to claim 1, characterized in that, The method for setting up the binocular structured light depth camera in step (1) is as follows: (1-1) Mount the camera on the end of the battery swapping robot and move it along the track to a position 40-60cm below the target battery pack to collect data; (1-2) Set the camera depth range to 50-3000mm, the depth map resolution to 960×600, and the color map resolution to 1920×1080. Ensure the coordinate system transformation accuracy through calibration tests. (1-3) Data collection covers locked and unlocked states, different lighting conditions, shooting distance, tilt angle and clutter interference scenarios to ensure the generalizability of the dataset.

3. The target point cloud segmentation method for battery swapping robots based on multi-scale attention aggregation according to claim 1, characterized in that, The preprocessing described in step (2) includes: (2-1) Radius filtering: Set the search radius r = 1.0 mm and the minimum number of neighbor points n = 10, remove noise points with neighborhood density below the threshold, and obtain point cloud P1; (2-2) DBSCAN density clustering: Based on the k-nearest neighbor distance distribution curve, the optimal neighborhood radius ε and the minimum number of points minPts are determined, low-density background point clusters are identified and removed, the signal-to-noise ratio of the target point cloud is improved, and the point cloud P2 is obtained.

4. The target point cloud segmentation method for battery swapping robots based on multi-scale attention aggregation according to claim 1, characterized in that, The encoder described in step (3) is specifically implemented as follows: (3-1) The local spatial coding unit first calculates the neighborhood index of each point in the preprocessed point cloud P2 using the K-nearest neighbor algorithm. Then, based on the spatial location and neighborhood relationship of the point, it calculates the local features of the points in the neighborhood. Next, it extracts the local geometric information features F1 through convolution operation. In this way, the module can capture the fine-grained geometric shape in the point cloud, including the distance between points, normal vectors and other local geometric features, thereby enhancing the network's sensitivity and representation ability to local features and effectively improving the accuracy of point cloud processing. (3-2) The attention pooling unit filters and aggregates the output features F1 of the local spatial encoding unit through global average pooling to generate attention scores. These weighting coefficients are used to dynamically adjust the weight W of each point. i This allows the model to focus more on the parts of the point cloud containing important geometric information in subsequent stages, thus increasing the attention score. The weighted aggregated features are obtained by multiplying the features element-wise with the neighborhood features F1. (3-3) Random downsampling is performed on the point cloud after local spatial feature aggregation. A random sampling strategy with linear complexity is used to select points from the point cloud proportionally. First, the sampling rate is set to retain 10%-20% of the points. Then, sampling points are selected from the point cloud in a uniform distribution manner to ensure that the computational complexity is reduced without losing key geometric features. A weighted strategy is adopted to prioritize the retention of points with important geometric information, such as edges or target regions, in order to further improve the quality of the sampled point cloud and obtain the sampled point cloud P3.

5. The target point cloud segmentation method for battery swapping robots based on multi-scale attention aggregation according to claim 1, characterized in that, The specific implementation of the multi-scale attention aggregation module in step (4) is as follows: (4-1) Receive the feature tensors from the high-resolution, medium-resolution, and low-resolution encoders output from each layer of the encoder, and concatenate them to obtain the feature F2 = [B, N, 1, C]. fused First, the features F2 are uniformly reduced to the shared latent space D = [C] using 1×1 convolution. out / α], where α is the channel compression factor, resulting in feature F′2=[B,N,1,C] fused [4], and then F′2 is input to the channel attention unit and the spatial attention unit respectively; (4-2) Channel Attention Unit: Compresses the input feature F′2 along the spatial dimension H×W to generate the channel description vector F. avg F max Pooling is then performed, followed by two nonlinear transformations on the pooling result through a fully connected layer containing two shared convolutions to obtain the feature F′. avg F′ max The outputs of the two branches are summed and passed through a sigmoid activation function to obtain the channel attention weights W. C Ultimately, the channel attention weight W is used. C The output feature F′ is obtained by scaling the input feature F2 channel by channel. C ; (4-3) Spatial Attention Unit: 3×3, 5×5, and 7×7 convolutional kernels are applied to the input feature F′2 to extract multi-scale local features. The three features are concatenated and then dimensionality-reduced by 1×1 convolution to generate feature group F. used , will F used The input spatial attention module performs adaptive reweighting along the spatial dimension. First, it calculates the mean and maximum values ​​along the channel dimension to generate two spatial attention basis vectors F. used_avg F used_max Then, spatial context relation W is extracted through 7×7 convolution. S Ultimately, through spatial attention weight W S The output feature F′ is obtained by scaling the input feature F′2 channel by channel. S Enhance response in key areas.

6. The target point cloud segmentation method for battery swapping robots based on multi-scale attention aggregation according to claim 1, characterized in that, The specific implementation of the decoder in step (5) is as follows: (5-1) The input feature matrix F and the pre-prepared nearest neighbor index for storing each target point are the interpolation index I. Calculate the interpolated feature matrix F'. (5-2) The decoder optimizes the edge segmentation accuracy by fusing multi-scale features F' layer by layer, improves the ability to preserve the geometric details of the target fastener, and finally obtains the accurate segmentation result point cloud P4.

7. The target point cloud segmentation method for battery swapping robots based on multi-scale attention aggregation according to claim 1, characterized in that, The point cloud subset extraction method described in step (6) is as follows: (6-1) Output the coordinates of point cloud P4 based on the semantic label value (0 / 1), separate the point clouds of the solar panel and the target fastener, and obtain point cloud P5; (6-2) The extracted point cloud data is saved in PLY or PCD format, which can be directly called by subsequent registration algorithms.