A robot operation target 6D pose tracking method based on time sequence point cloud fusion

By acquiring point cloud video sequences using an RGB-D camera and performing feature fusion, the problem of information interaction difficulties caused by the irregularity of 3D point cloud data was solved, enabling accurate and stable 6D pose tracking of robot targets and improving the performance of robot operations.

CN117765032BActive Publication Date: 2026-06-23ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2023-11-24
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

The lack of explicit alignment relationships in existing 3D point cloud data during robot operations makes it difficult to exchange information between point clouds during temporal observation, affecting the accuracy and stability of pose tracking.

Method used

By acquiring point cloud video sequences using an RGB-D camera, extracting features through point cloud segmentation and cross-attention mechanisms, constructing dense point cloud pairs, and performing point-by-point feature fusion, the robot predicts relative transformation poses and achieves 6D pose tracking of the robot's target.

Benefits of technology

It improves the accuracy and stability of target pose tracking for robot operations, making it suitable for continuous operation tasks in dynamic scenarios and enhancing the flexibility and intelligence of the robot system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117765032B_ABST
    Figure CN117765032B_ABST
Patent Text Reader

Abstract

The application discloses a robot operation target 6D pose tracking method based on time sequence point cloud fusion. First, a point cloud video sequence of a robot operation scene is acquired and an operation target is segmented from the point cloud video sequence, then, dense corresponding relations among time sequence observation point clouds are established in a reverse prediction mode, and the dense corresponding relations are taken as a fusion criterion to perform point-by-point feature fusion on the time sequence observation point clouds, on the basis of the point-by-point feature fusion, a transformation pose between adjacent frames is predicted from the time sequence fused features in a confidence degree regression mode, finally, the transformation pose between the adjacent frames is continuously regressed to realize continuous pose tracking of the operation target. The application tracks a 6D pose of a robot operation target from a three-dimensional point cloud video sequence based on a deep learning technology, in view of irregularity and disorder of three-dimensional point cloud data, time sequence correlation information among observation point cloud data is fully utilized and effective feature fusion is performed, accurate and stable tracking is realized, and the application has good engineering practical value.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a 6D pose tracking method for robot operation targets in the field of 3D vision and robotics, and particularly to a 6D pose tracking method for robot operation targets based on temporal point cloud fusion. Background Technology

[0002] As an important intelligent production unit in modern industrial systems, robots typically need to dynamically perceive the surrounding work environment, especially to continuously track the 6D pose of the work target in the environment in order to fully understand the spatial position of the work target and make corresponding action decisions, thereby meeting the robot's dynamic response requirements for continuous work tasks.

[0003] Thanks to the rapid development of various depth sensors (including RGB-D cameras, LiDAR, etc.) and deep learning technologies, 3D vision methods have shown valuable application prospects in the field of target pose recognition. 3D vision provides richer spatial geometric information compared to 2D vision, which is beneficial for achieving more accurate 6D pose recognition. However, these methods mainly focus on pose estimation in single-frame observation data. Since they do not consider the consistent continuity of time and space, the pose tracking results obtained from continuous multi-frame observations may be unstable, which can seriously affect the performance of robot tasks.

[0004] Making full use of the temporal information in the observation sequence to achieve pose tracking is the key to solving this problem. However, 3D point clouds are different from 2D images with regular pixel arrangements. Due to the irregularity and disorder of point cloud data, there is a lack of explicit alignment relationships between temporally observed point clouds, making it difficult to carry out effective information interaction and feature fusion, thereby reducing the accuracy of pose tracking results. Summary of the Invention

[0005] To address the problems existing in the background technology, this invention proposes a 6D pose tracking method for robot operation targets based on temporal point cloud fusion. This method utilizes deep learning technology to achieve 6D pose tracking of robot operation targets in point cloud video sequences, effectively improving the accuracy and stability of pose tracking.

[0006] The specific technical solution created by this invention is as follows:

[0007] S1: Continuously acquire and obtain the original point cloud video sequence of the robot's operation scene through an RGB-D camera;

[0008] S2: After performing point cloud segmentation on the original point cloud video sequence, an observation point cloud video sequence containing only the robot's work target is obtained, and the initial frame temporal observation point cloud is determined.

[0009] S3: Extract the features corresponding to each point cloud in the temporal observation point cloud of the current frame and its previous frame, and use the reverse prediction method to construct a dense point cloud pair of the temporal observation point clouds of the current two adjacent frames.

[0010] S4: Based on dense point cloud pairs, the features corresponding to each point cloud in the temporal observation point cloud of the current frame and its previous frame are fused point by point to obtain fused point cloud features, and then the relative transformation pose between the two adjacent frames is predicted.

[0011] S5: Repeat S3-S4 to calculate the relative transformation pose between the remaining two adjacent temporal observation point clouds in the video sequence containing only the robot's work target, thereby achieving 6D pose tracking of the work target.

[0012] In step S2, for each frame of the original point cloud in the original point cloud video sequence, the semantic category of all points in the original point cloud of each frame is predicted using a point cloud segmentation algorithm, and then only the points belonging to the robot's task target are retained to obtain the temporal observation point cloud of the current frame.

[0013] Specifically, S3 is:

[0014] S31: Employ a cross-attention mechanism to analyze the temporal observation point cloud P in the current frame. t And the point cloud P observed in the previous frame t-1 Feature encoding is performed to obtain the temporal observation point cloud P of the current frame. t Point cloud features F t And the point cloud P observed in the previous frame t-1 Point cloud features F t-1 ;

[0015] S32: Observe the point cloud P according to the current frame time sequence. t Point cloud features F t Predict and obtain the current frame temporal observation point cloud P t offset x of each point in the middle i Based on the offset x of each point i Then observe the point cloud P in the current frame time sequence. t Perform coordinate transformation to obtain the time-series observation point cloud after coordinate transformation.

[0016] S33: Time-series observation point cloud after coordinate transformation Each point in the image and its corresponding temporal observation point cloud P in the previous frame t-1 Pairing is performed to obtain dense point cloud pairs of the current two adjacent temporal observation point clouds.

[0017] In step S32, the current frame temporal observation point cloud P is... t Point cloud features F t After being fed into a multilayer perceptron, the temporal observation point cloud P of the current frame is predicted.t offset x of each point in the middle i In the multilayer perceptron, the offset x is used to supervise the operation. i Predicted loss function The formula is as follows:

[0018]

[0019]

[0020] Where N represents the number of points in the temporal observation point cloud of the current frame, and ||2 represents the L2 norm. Indicates offset x i Ideal truth value, This represents the temporal observation point cloud P of the previous frame. t-1 With the current frame time-series observation point cloud P t Ideal relative pose transformation between them. t p i P represents the temporal observation point cloud of the current frame. t Point i in the middle.

[0021] In S33, the time-series observation point cloud after coordinate transformation Each point in The point cloud P observed in the previous frame t-1 Find the point closest to it t-1 p j From point t p i and points t-1 p j Form a point cloud pair.

[0022] In step S4, for each point in the current frame temporal observation point cloud, the points in the current frame temporal observation point cloud are compared based on the dense point cloud pair. t p i The features of the point cloud in the previous frame are related to the temporal observation points. t-1 p j The features are concatenated to obtain the fusion point feature φf. i The set of all fused point features is taken as the fused point cloud feature φF.

[0023] In step S4, several relative transformation poses are predicted by regression from the fused point cloud features of the current frame, and the confidence of each relative transformation pose is calculated. The relative transformation pose with the highest confidence is taken as the relative transformation pose between the two adjacent frames.

[0024] In S5, the 6D pose M of the target in the k-th frame k The 6D pose M of the target in the k-th frame includes three translational components and three rotational components. kThe calculation formula is as follows:

[0025] M k =M k-1 ·ΔM

[0026] Among them, M k-1 ΔM represents the 6D pose of the target in frame k-1, and ΔM represents the relative pose transformation between frame k and frame k-1.

[0027] The beneficial effects of this invention are as follows:

[0028] (1) This invention realizes continuous pose tracking of robot operation targets from point cloud video sequences. By utilizing the rich geometric information contained in the three-dimensional point cloud data, it solves the problem of poor tracking accuracy of existing two-dimensional image vision methods.

[0029] (2) This invention establishes a dense correspondence between time-series observation point clouds and performs point-by-point feature fusion on the time-series observation point clouds based on this relationship. This solves the problem of difficulty in time-series association information interaction caused by the irregularity of three-dimensional point cloud data, thereby improving the accuracy and stability of continuous tracking of robot operation target pose in dynamic scenes.

[0030] (3) The pose tracking method provided by the present invention can be directly deployed to various downstream robot applications, effectively improving the flexibility and intelligence of the robot system in the face of actual tasks.

[0031] In summary, the robot target pose tracking method provided by this invention performs continuous pose inference in three-dimensional point cloud space. In view of the irregularity and disorder of three-dimensional point cloud data, it makes full use of the temporal correlation information between observed point cloud data and performs effective feature fusion, thereby achieving accurate and stable 6D pose tracking, which has good engineering practical value. Attached Figure Description

[0032] Figure 1 This is a flowchart of the robot operation target 6D pose tracking based on temporal point cloud fusion in this invention.

[0033] Figure 2 This is a robot grasping a target object used for method testing in an embodiment of the present invention.

[0034] Figure 3 This is a schematic diagram of a point cloud video sequence in the YCBInEOAT dataset of this invention.

[0035] Figure 4 This is a schematic diagram illustrating the process of establishing dense correspondences between time-series observation point clouds in an embodiment of the present invention.

[0036] Figure 5This is a graph showing the test results of pose tracking stability in an embodiment of the present invention.

[0037] Figure 6 This is a schematic diagram illustrating the actual tracking effect in an embodiment of the present invention. Detailed Implementation

[0038] The invention will be further explained below based on the public benchmark dataset (YCBInEOAT) used to test target pose tracking performance:

[0039] In this embodiment, the robot target 6D pose tracking method based on temporal point cloud fusion provided by the present invention is tested on the YCBInEOAT dataset. The YCBInEOAT dataset contains multiple RGB-D videos recording the robot performing grasping tasks, covering five different categories of target objects (see...). Figure 2 The robot used was the Yaskawa Motoman dual-arm robot, and the RGB-D camera was an Azure Kinect camera. In each video segment, the target object was first placed on the worktable, then the robot moved to the vicinity of the target object and grasped it. After being grasped by the robot, the target object moved spatially with the robot. The goal of the tracking task was to continuously infer the 6D pose of the target object in the observation coordinate system (i.e., the camera coordinate system) in each frame of observation data. Each frame of the video sequence contains a raw RGB image, a depth map, and the target object's mask, category, pose, and other ground truth label information in the current frame.

[0040] The robot target 6D pose tracking process in this invention is as follows: Figure 1 As shown. For this dataset, the specific steps of the pose tracking method provided by this invention are as follows:

[0041] S1: Continuously acquire and obtain the original point cloud video sequence of the robot's operation scene through an RGB-D camera, where each frame of point cloud is converted from the depth map acquired by the RGB-D camera;

[0042] In this embodiment, the conversion of the depth map into a 3D point cloud requires the support of camera intrinsic parameters. The specific camera intrinsic parameters are: [320.21,244.34,319.58,417.11].

[0043] In this embodiment, the point cloud video sequence collected from the YCBInEOAT dataset is as follows: Figure 3 As shown.

[0044] S2: After segmenting the original point cloud video sequence using a point cloud segmentation algorithm, we obtain an observation point cloud video sequence containing only the robot's work target, and determine the initial frame temporal observation point cloud.

[0045] Specifically, for each frame of the original point cloud in the original point cloud video sequence, the semantic category of all points in the original point cloud of each frame is predicted using a point cloud segmentation algorithm, and then background points are removed, that is, only the points belonging to the robot's task target are retained, to obtain the temporal observation point cloud of the current frame.

[0046] In this embodiment, the KPConv algorithm is used for point cloud segmentation.

[0047] S3: Extract the features corresponding to each point cloud in the temporal observation point cloud of the current frame and its previous frame, and use the reverse prediction method to construct a dense point cloud pair of the temporal observation point clouds of the current two adjacent frames.

[0048] like Figure 4 As shown, S3 specifically refers to:

[0049] S31: Employ a cross-attention mechanism to analyze the temporal observation point cloud P in the current frame. t And the point cloud P observed in the previous frame t-1 Feature encoding is performed to obtain the temporal observation point cloud P of the current frame. t Point cloud features F t And the point cloud P observed in the previous frame t-1 Point cloud features F t-1 ;

[0050] S32: Observe the point cloud P according to the current frame time sequence. t Point cloud features F t Predict and obtain the current frame temporal observation point cloud P t offset x of each point in the middle i Based on the offset x of each point i Then observe the point cloud P in the current frame time sequence. t Perform coordinate transformation to obtain the time-series observation point cloud after coordinate transformation. For the current frame time-series observation point cloud P t any point in t p i ∈P t In other words, the transformed point Represented as:

[0051]

[0052] Current frame temporal observation point cloud P t offset x of each point in the middle i It is by observing the point cloud P in the current frame time sequence t Point cloud features F t The prediction obtained after being fed into a multilayer perceptron (MLP) network, where the multilayer perceptron is used to supervise the offset x. i Predicted loss function The formula is as follows:

[0053]

[0054]

[0055] Where N represents the number of points in the temporal observation point cloud of the current frame, and ||2 represents the L2 norm. Indicates offset x i Ideal truth value, This represents the temporal observation point cloud P of the previous frame. t-1 With the current frame time-series observation point cloud P t Ideal relative pose transformation between them. t p i P represents the temporal observation point cloud of the current frame. t Point i in the middle.

[0056] S33: Time-series observation point cloud after coordinate transformation Each point in the image and its corresponding temporal observation point cloud P in the previous frame t-1 Pairing is performed to obtain dense point cloud pairs of the current two adjacent temporal observation point clouds.

[0057] Among them, for the time-series observation point cloud after coordinate transformation Each point in The point cloud P observed in the previous frame t-1 Find the point closest to it t-1 p j From point t p i and points t-1 p j Form a point cloud pair.

[0058] S4: Based on dense point cloud pairs, the features corresponding to each point cloud in the temporal observation point cloud of the current frame and its previous frame are fused point by point to obtain fused point cloud features, and then the relative transformation pose between the two adjacent frames is predicted.

[0059] In S4, for each point in the current frame temporal observation point cloud, the points in the current frame temporal observation point cloud are compared based on the dense point cloud pair. t p i The features of the point cloud in the previous frame are related to the temporal observation points. t-1 p j The features are concatenated to obtain the fusion point feature φf. i The set of all fused point features is taken as the fused point cloud feature φF.

[0060] In S4, several relative transformation poses are predicted from the fused point cloud features of the current frame through supervised training, and the confidence of each relative transformation pose is calculated. The relative transformation pose with the highest confidence is taken as the relative transformation pose between the two adjacent frames.

[0061] S5: Repeat S3-S4 to calculate the relative transformation pose between the remaining two adjacent temporal observation point clouds in the video sequence containing only the robot's work target. Combine this with the initial pose of the work target to achieve 6D pose tracking of the work target.

[0062] Among them, the 6D pose M of the target in the k-th frame k The 6D pose M of the target in the k-th frame includes three translational components and three rotational components. k The calculation formula is as follows:

[0063] M k =M k-1 ·ΔM

[0064] Among them, M k-1 Let M represent the 6D pose of the target in frame k-1, and ΔM represent the relative pose transformation between frame k and frame k-1, which can be obtained through steps S3 to S4. Given the initial pose M0 of the target, iterative calculation based on the above formula can obtain the 6D pose of the target in each frame of the observed point cloud video sequence, thereby realizing 6D (6 degrees of freedom) pose tracking of the target.

[0065] In this embodiment, the ideal labeled pose provided by the dataset is used to initialize the target pose of the operation in the starting frame. It should be noted that the initial pose can also be estimated by manual calibration or various pose recognition methods.

[0066] This embodiment uses the NVIDIA RTX 3090 graphics processor and the PyTorch framework for model training and testing. The test results are as follows.

[0067] The quantitative results of pose tracking accuracy tests are shown in Table 1. The AUC (Average Model Distance-S) metric, based on the average model distance (ADD-S), was used to evaluate the accuracy of pose tracking, with the threshold for the AUC curve set to 0.1m. It can be seen that the average pose tracking accuracy of the method described in this invention reaches over 90%.

[0068] Table 1 shows the accuracy test results (%) of the pose tracking method of the present invention.

[0069]

[0070] Quantitative pose tracking stability test results are as follows Figure 5 As shown. Figure 5In this method, based on a given video sequence segment, the prediction error of relative pose between adjacent frames is used as an evaluation index for tracking stability. A pose deviation of 1 cm / 3° is used as a threshold, and the proportion δ exceeding this threshold is recorded every 50 frames. To further verify the superiority of the method, a classic representative method (DenseFusion) is selected for comparison. This comparison method does not utilize temporal information but instead uses a DenseFusion model to perform independent and repeated pose estimation in each frame observation. It can be seen that the method of this invention exhibits better stability in pose tracking, with the proportion δ of relative pose prediction error exceeding the threshold being less than 20%.

[0071] In addition, some qualitative tracking results, such as Figure 6 As shown, Figure 6 a, Figure 6 b, Figure 6 c Figure 6 The values ​​of 'd' represent the pose tracking results at different times during the robot's object grasping process. The point cloud model of the object is reprojected onto the image plane with the tracked pose, and the corresponding bounding box is drawn. It can be seen that the method of this invention can successfully track the pose of the robot's target, further verifying the effectiveness of the method.

[0072] This invention is not limited to the above-described embodiments, but is only used to help understand the method and core idea of ​​this invention. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from the principles of this invention, and these improvements and modifications also fall within the protection scope of the claims of this application. Contents not described in detail in this specification are prior art known to those skilled in the art.

Claims

1. A method for 6D pose tracking of a robot task target based on temporal point cloud fusion, characterized in that, Includes the following steps: S1: Continuously acquire and obtain the original point cloud video sequence of the robot's operation scene through an RGB-D camera; S2: After performing point cloud segmentation on the original point cloud video sequence, an observation point cloud video sequence containing only the robot's work target is obtained, and the initial frame temporal observation point cloud is determined. S3: Extract the features corresponding to each point cloud in the temporal observation point cloud of the current frame and its previous frame, and use the reverse prediction method to construct a dense point cloud pair of the temporal observation point clouds of the current two adjacent frames. Specifically, S3 is: S31: Employ a cross-attention mechanism to analyze the temporal observation point cloud P in the current frame. t And the point cloud P observed in the previous frame t-1 Feature encoding is performed to obtain the temporal observation point cloud P of the current frame. t Point cloud features F t And the point cloud P observed in the previous frame t-1 Point cloud features F t-1 ; S32: Observe the point cloud P according to the current frame time sequence. t Point cloud features F t Predict and obtain the current frame temporal observation point cloud P t offset of each point in the middle Based on the offset of each point Then observe the point cloud P in the current frame time sequence. t Perform coordinate transformation to obtain the time-series observation point cloud after coordinate transformation. ; S33: Time-series observation point cloud after coordinate transformation Each point in the image and its corresponding temporal observation point cloud P in the previous frame t-1 Pairing is performed to obtain dense point cloud pairs of the current two adjacent temporal observation point clouds; S4: Based on dense point cloud pairs, the features corresponding to each point cloud in the temporal observation point cloud of the current frame and its previous frame are fused point by point to obtain fused point cloud features, and then the relative transformation pose between the two adjacent frames is predicted. S5: Repeat S3-S4 to calculate the relative transformation pose between the remaining two adjacent temporal observation point clouds in the video sequence containing only the robot's work target, thereby achieving 6D pose tracking of the work target.

2. The method for 6D pose tracking of a robot task target based on temporal point cloud fusion according to claim 1, characterized in that, In step S2, for each frame of the original point cloud in the original point cloud video sequence, the semantic category of all points in the original point cloud of each frame is predicted using a point cloud segmentation algorithm, and then only the points belonging to the robot's task target are retained to obtain the temporal observation point cloud of the current frame.

3. The method for 6D pose tracking of a robot task target based on temporal point cloud fusion according to claim 1, characterized in that, In step S32, the current frame temporal observation point cloud P is... t Point cloud features F t After being fed into a multilayer perceptron, the temporal observation point cloud P of the current frame is predicted. t offset of each point in the middle Among them, the multilayer perceptron is used to supervise the offset. Predicted loss function The formula is as follows: Where N represents the number of points in the temporal observation point cloud of the current frame. Represents the L2 norm, Indicates offset Ideal truth value, This represents the temporal observation point cloud P of the previous frame. t-1 With the current frame time-series observation point cloud P t Ideal relative pose transformation between them. P represents the temporal observation point cloud of the current frame. t Point i in the middle.

4. The method for 6D pose tracking of a robot task target based on temporal point cloud fusion according to claim 1, characterized in that, In S33, the time-series observation point cloud after coordinate transformation Each point in In the previous frame, observe the point cloud P. t-1 Find the point closest to it From point and points Form a point cloud pair.

5. The method for 6D pose tracking of a robot task target based on temporal point cloud fusion according to claim 1, characterized in that, In step S4, for each point in the current frame temporal observation point cloud, the points in the current frame temporal observation point cloud are compared based on the dense point cloud pair. The features of the point cloud in the previous frame are related to the temporal observation points. The features are spliced ​​together to obtain the fusion point features. The set of all fused point features is taken as the fused point cloud feature. .

6. The method for 6D pose tracking of a robot task target based on temporal point cloud fusion according to claim 1, characterized in that, In step S4, several relative transformation poses are predicted by regression from the fused point cloud features of the current frame, and the confidence of each relative transformation pose is calculated. The relative transformation pose with the highest confidence is taken as the relative transformation pose between the two adjacent frames.

7. The method for 6D pose tracking of a robot task target based on temporal point cloud fusion according to claim 1, characterized in that, In S5, the 6D pose of the target in the k-th frame The 6D pose of the target in the k-th frame includes three translational components and three rotational components. The calculation formula is as follows: in, This represents the 6D pose of the target in the (k-1)th frame. This represents the relative pose transformation between the k-th frame and the (k-1)-th frame.