A robot joint layer perception system fusing ToF and monocular vision

By installing a ToF depth camera and a monocular RGB camera at the robot's end effector, and combining bilateral filtering and Kalman filtering to optimize the depth map, and using the SURF algorithm and ICP point cloud registration for target recognition and pose estimation, the problem of fixed perception viewpoint and difficulty in balancing depth accuracy and image detail in robot vision systems for precision assembly and flexible grasping tasks is solved, thus achieving high-precision target detection and grasping control.

CN120503262BActive Publication Date: 2026-06-26WUXI SMART POWER ROBOT CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
WUXI SMART POWER ROBOT CO LTD
Filing Date
2025-06-11
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing robot vision systems suffer from problems such as fixed perception perspective, susceptibility to occlusion by robotic arms, reliance on high-precision external parameter calibration, long perception distance, and difficulty in achieving both depth accuracy and image detail in precision assembly and flexible grasping tasks, resulting in low success rates.

Method used

A ToF depth camera and a monocular RGB camera are mounted using an eye-in-hand structure. The depth map is optimized by combining bilateral filtering and Kalman filtering. Target recognition and attitude estimation are performed using the SURF algorithm and ICP point cloud registration. Path planning and grasping control are achieved by combining improved RRT path planning and hierarchical collision detection.

Benefits of technology

It improves the target detection accuracy, attitude estimation robustness and trajectory planning efficiency in dynamic and complex scenarios. The system positioning accuracy reaches ±3mm, and the success rate of dynamic scene capture is high.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120503262B_ABST
    Figure CN120503262B_ABST
Patent Text Reader

Abstract

The application discloses a kind of fusion ToF and monocular vision's robot joint layer perception system, belong to robot perception field, system uses Eye-in-Hand architecture, integrates ToF camera and RGB camera;Through joint bilateral filtering and Kalman filtering, improve depth map quality;Based on SURF feature matching and the ICP registration of KD-Tree acceleration is realized target six degree of freedom pose estimation;Combined with improved RRT algorithm and hierarchical collision detection planning capture path.The robot joint layer perception system of fusion ToF and monocular vision provided in the application, through the space alignment and fusion of two kinds of visual information, make up for respective limitations, improve the robustness and precision of overall perception system;In single-arm and double-arm capture experiment, system positioning accuracy reaches ±3mm, dynamic scene capture success rate is high.It is suitable for industrial automation, logistics sorting and the like.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of robot perception technology and relates to a robot joint layer perception system that integrates ToF (Time of Flight) and monocular vision, which is suitable for industrial grasping, precision assembly and other scenarios. Background Technology

[0002] With the rapid development of industrial automation and intelligent manufacturing, the application of robot systems in tasks such as grasping, handling, and assembly is showing a diversified development trend. By integrating multimodal sensors such as vision, touch, and force sensors and fusing machine learning algorithms, modern robots have initially acquired the ability to operate autonomously in complex environments. Among them, visual perception, as the core supporting technology for achieving autonomous operation, directly affects the task execution efficiency in dynamic and complex scenarios due to its spatiotemporal accuracy and real-time performance. Especially in industrial scenarios such as precision assembly and flexible grasping, high-precision environmental perception within a near-field region of less than 50cm for end effectors has become a key bottleneck restricting the success rate of operations.

[0003] Traditional robot vision systems mostly adopt an eye-to-hand architecture, which fixes the sensors to the periphery of the robot's workspace. This architecture has the following shortcomings: First, the perception field of view is fixed and easily obstructed by the robotic arm, resulting in blind spots; second, the system relies on high-precision extrinsic parameter calibration, leading to high deployment and maintenance costs; third, the perception distance is long, making it difficult to simultaneously achieve depth accuracy and image detail, thus limiting its application in fine grasping tasks.

[0004] Current research mainly evolves along three technical routes: (1) Tactile perception enhancement schemes, such as the GTac biomimetic tactile sensor developed by Lu's team (Lu Z, Gao X, Yu H. GTac: A biomimetic tactile sensor with skin-like heterogeneous force feedback for robots[J]. IEEE Sensors Journal, 2022, 22(14): 14491-14500.), which achieves 0.1N force resolution through skin-like contact feedback, but has a response delay of about 20ms and depends on physical contact triggering; (2) Global vision guidance schemes, such as Nguyen et al. (Nguyen K, Dang T, Huber M. Real-time 3d semantic scene perception for egocentric robots with binocular vision[J]. arXiv preprint arXiv:2402.11872,2024.) Based on the D435i camera, the binocular vision system was used to achieve object segmentation and grasping on the Baxter robot, but the end-positioning error reached 7.3mm due to the occlusion of the body; For example, S. Jain et al. (Jain S, Argall B. Grasp detection for assistive robotic manipulation[C] / / 2016IEEE International Conference on Robotics and Automation(ICRA).IEEE,2016:2015-2021.) proposed a new grasping detection algorithm for the assistive robot operating system. The RGB-D sensor is located outside the robotic arm. This method can detect human-like grasping methods of various invisible household objects and has been successfully applied to the autonomous grasping task of the MICO robotic arm. However, in the long-distance perception mode, the depth resolution of the RGB-D camera drops significantly; (3) End-embedded perception scheme, such as the Zeng team (Zeng A, Song S, Yu KT, et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching[J].The International Journal of Robotics Research,2022,41(7):690-705.The Eye-in-Hand system, using an end-effector RGB-D camera, improves positioning accuracy to 2.1mm, but suffers from a multimodal data fusion latency of 120ms. Experimental studies show that a single sensor has a depth loss rate as high as 32.7% in low-texture areas, and the pose estimation error in dynamic scenes fluctuates within ±5°. Summary of the Invention

[0005] The purpose of this invention is to provide a robot joint-level perception system that integrates Time-of-Flight (ToF) and monocular vision, overcoming the shortcomings of existing technologies and improving target detection accuracy, pose estimation robustness, and trajectory planning efficiency in dynamic and complex scenarios. A vision sensor is mounted on the robot's end effector (Eye-in-Hand structure) to achieve real-time environmental perception close to the operating area. The ToF depth camera provides high-frame-rate, interference-resistant 3D depth information, suitable for close-range mapping; while the monocular RGB camera provides rich texture information, beneficial for target recognition and feature extraction. By spatially aligning and fusing the two types of visual information, the limitations of each are compensated to some extent, improving the robustness and accuracy of the overall perception system. The system provided by this invention utilizes the complementary advantages of dual sensors, combining joint bilateral filtering and Kalman filtering techniques to optimize Time-of-Flight (ToF) depth maps. It proposes an ICP (Iterative Closest Point) registration method based on SURF (Accelerated Robust Features) and KD-Tree (k-dimensional tree) acceleration to achieve high-precision six-DOF pose estimation of the target. Simultaneously, it constructs a complete integrated perception-recognition-estimation-grabbing process based on the ROS platform, supporting path planning and control execution for single-arm and dual-arm collaborative tasks. The objectives of this invention are achieved through the following specific technical solutions.

[0006] A robot joint-level perception system integrating Time-of-Flight (ToF) and monocular vision, comprising:

[0007] Perception module: integrates a ToF depth camera and a monocular RGB camera, and is installed at the end of the robot in an eye-in-hand structure;

[0008] Depth map optimization processing module: Combines bilateral filtering and Kalman filtering for two-stage optimization. Using the RGB image as the guide image, it performs edge-preserving filtering on the ToF depth map to suppress noise while preserving object contour information;

[0009] Target recognition and pose estimation module: Target recognition uses the SURF algorithm to extract local feature points from RGB images, combined with FLANN (Fast Nearest Neighbor Approximation Search) to accelerate matching and establish a matching relationship with the target template image. After filtering out erroneous matching points using the RANSAC (Random Sample Consensus) algorithm, the image homography matrix is ​​calculated to obtain the target region ROI (Region of Interest). Pose estimation uses KD-Tree accelerated ICP point cloud registration to achieve six-degree-of-freedom pose estimation.

[0010] Path planning and grasping control module: Based on the ROS platform, path planning and grasping control are achieved through an improved RRT (Fast Random Tree) path planning algorithm, a hierarchical collision detection mechanism, a single-arm grasping control process, and a dual-arm collaborative control strategy.

[0011] Furthermore, the sensing module uses Zhang Zhengyou's chessboard calibration method to calibrate the intrinsic parameters of the monocular RGB camera.

[0012] Furthermore, the specific process of two-stage optimization using combined bilateral filtering and Kalman filtering in the depth map optimization processing module is as follows:

[0013] Joint bilateral filtering: The RGB image guides the ToF depth map filtering, with the following weighting function:

[0014] Where p is the depth value of the target pixel, I(p) is the final new depth value of the target pixel, q is the original depth value of the neighboring pixels in the original target depth map, and G... s (||pq||) is the spatial distance weight, G r (|G(q)-G(p)|) is the color similarity weight function;

[0015] Kalman filtering:

[0016] (1) Establish state equations for depth values

[0017] x k =Agx k-1 +w k

[0018] z k =Hgx k +v k

[0019] Where: x k x k-1 These are the depth values ​​for the k-th and (k-1)-th frames, respectively, in mm.

[0020] A: State transition matrix, which describes the change pattern of depth values ​​between each frame;

[0021] w k x k Process noise and measurement noise follow a Gaussian distribution with a mean of 0 and a variance of Q.

[0022] z k : The measured depth value of the k-th frame, in mm;

[0023] H: Observation matrix, usually the identity matrix;

[0024] (2) Prediction and updating: each frame of data is processed recursively; the error covariance matrix in the prediction stage. for:

[0025]

[0026] in: The prediction error covariance matrix of the k-th frame; P k-1 The error covariance matrix of the (k-1)th frame;

[0027] Update phase Kalman gain K k for:

[0028]

[0029] Where R is the error covariance matrix of the observation noise.

[0030] Furthermore, the improved RRT path planning algorithm includes:

[0031] (1) Target point biased sampling: When expanding the tree structure, the system directly selects the target pose as the sampling point with a certain probability to enhance the guidance of the search tree and effectively improve the path convergence speed and search efficiency.

[0032] (2) Dynamic optimization of parent node and path pruning: After a new node is generated, a local search region is constructed with the node as the center, and its parent node is dynamically updated as the local optimal connection point; at the same time, the trajectory reconnection algorithm is used to remove redundant nodes, so that the generated path is shorter and smoother, reducing the shaking or energy consumption of the robotic arm during the execution process.

[0033] Furthermore, the layered collision detection mechanism includes:

[0034] (1) Construct the geometric model of the robotic arm based on the minimum cylinder envelope method;

[0035] (2) Decoupling detection of master and slave arms in dual-arm task: First, plan the path of the master arm and record the temporal pose. Then, use the movement of the master arm as a dynamic obstacle to plan the path of the slave arm.

[0036] Furthermore, the single-arm grasping control process includes:

[0037] (1) Pre-grabbing posture generation: Set a set of safety buffer postures in front of the target position to ensure that the grabbing direction is consistent with the main axis of the object;

[0038] (2) Synchronous control mechanism: The gripper starts closing simultaneously when it reaches the preset position at the end. The system adaptively adjusts the closing speed and force according to the gripper status feedback.

[0039] (3) Grab and exit path planning: After confirming successful gripping, replan the shortest obstacle avoidance path based on the current environment to place the object in the designated area.

[0040] Furthermore, the dual-arm cooperative control strategy includes:

[0041] (1) Fix the position of the slave arm, treat it as an obstacle, and plan the path of the master arm;

[0042] (2) Based on the inverse solution of the main arm trajectory, the synchronous motion of the slave arm is achieved by coordinating the relative pose constraints between the ends of the two arms and the object.

[0043] (3) Perform time synchronization interpolation to ensure the dynamic consistency of the two arms throughout the mission.

[0044] The present invention has the following beneficial technical effects: The robot joint layer perception system provided by the present invention integrates ToF and monocular vision. By spatially aligning and fusing the two types of visual information, the limitations of each can be made up for, and the robustness and accuracy of the overall perception system can be improved. In single and dual-arm grasping experiments, the system positioning accuracy reaches ±3mm and the success rate of grasping in dynamic scenes is high. Attached Figure Description

[0045] Figure 1 It is a diagram showing the relationship between different coordinate systems.

[0046] Figure 2 This is a feature description diagram of the SURF algorithm.

[0047] Figure 3 This is a flowchart of the collision detection strategy.

[0048] Figure 4 This is a simulation diagram of a robot grasping experimental platform.

[0049] Figure 5 This is a diagram of the positioning and grasping process in Example 1.

[0050] Figure 6 This is a diagram of the positioning and grasping process in Example 2.

[0051] Figure 7 This is the RGB image of the book in Example 3.

[0052] Figure 8This is a TOF depth image of the book in Example 3.

[0053] Figure 9 This is a three-dimensional point cloud diagram of the book in Example 3.

[0054] Figure 10 This is the ROI point cloud map of the book in Example 3.

[0055] Figure 11 This is a diagram of the book positioning and retrieval process in Example 3. Detailed Implementation

[0056] The technical solution of the present invention will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the protection scope of the present invention.

[0057] In the description of this invention, it should be understood that the terms "center," "longitudinal," "lateral," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are used only for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance, quantity, or position.

[0058] In the description of this invention, it should be noted that, unless otherwise explicitly specified and limited, the terms "installation," "connection," and "linking" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.

[0059] A robot joint-layer perception system integrating Time-of-Flight (ToF) and monocular vision includes: a perception module integrating a ToF depth camera and a monocular RGB camera, mounted on the robot's end effector in an eye-in-hand structure; a depth map optimization processing module performing two-stage optimization using bilateral filtering and Kalman filtering, using the RGB image as a guide map to perform edge-preserving filtering on the ToF depth map, suppressing noise while retaining object contour information; a target recognition and pose estimation module extracting local feature points from the RGB image using the SURF algorithm, establishing a matching relationship with the target template image using FLANN accelerated matching, filtering out erroneous matching points using the RANSAC algorithm, and calculating the image homography matrix to obtain the target region ROI; pose estimation using KD-Tree accelerated ICP point cloud registration to achieve six-degree-of-freedom pose estimation; and a path planning and grasping control module based on the ROS platform, implementing path planning and grasping control through an improved RRT path planning algorithm, a layered collision detection mechanism, a single-arm grasping control process, and a dual-arm cooperative control strategy. Details are as follows.

[0060] (1) Perception Module

[0061] A ToF depth camera and an industrial-grade monocular RGB camera are mounted above the end effector gripper. The two cameras are fixed on the same bracket with overlapping fields of view, forming a multimodal sensing base. The RGB camera's intrinsic parameters are calibrated using Zhang Zhengyou's checkerboard calibration method to obtain parameters such as focal length, principal point, and distortion coefficient.

[0062] In the RGB visual system, the transformation relationships between four spatial coordinate systems are mainly involved, such as... Figure 1 As shown. The pixel coordinate system (uv) uses the top-left corner of the acquired image as its origin; the physical coordinate system (xy) uses the image center as its origin; and the camera coordinate system (X...)... c -Y c -Z c With the optical center as the origin and the camera direction as +Z, c Direction; World coordinate system X w -Y w -Z w .

[0063] Use the Tsai-Lenz method (Tsai RY, Lenz R KA new technique for fully autonomous and efficient3d robotics hand / eye calibration [J]. IEEE Transactions on Robotics and automation, 1989, 5(3): 345-358.), Park method (Park FC, Martin BJ. Robot sensor calibration: solving AX=XB on the Euclidean group [J]. IEEE Transactions on Robotics and Automation,1994,10(5):717-721.), Horaud method (Horaud R,Dornaika F.Hand-eye calibration[J].The internationaljournal ofrobotics research,1995,14(3):195-210.) and Daniilidis method (Daniilidis K.Hand-eye calibration using dual quaternions[J].The International Journal ofRobotics Research, 1999, 18(3): 286-298.) Complete the transformation matrix calculation between the end effector of the robotic arm and the camera, realize the transformation from the image coordinate system to the robot base coordinate system, and provide a geometric basis for subsequent point cloud projection and grasping planning.

[0064] (2) Depth Map Optimization Processing Module

[0065] Because Time-of-Flight (ToF) depth maps are susceptible to ambient light interference, edge transitions, and surface reflectivity, the raw data contains significant noise and jitter. To improve the quality of depth information, this invention proposes a two-stage optimization strategy combining bilateral filtering and Kalman filtering. Using an RGB image as a guide image, edge-preserving filtering is applied to the ToF depth map to suppress noise while retaining object contour information.

[0066] The core idea of ​​joint bilateral filtering (Wang Decheng, Chen Xiangning, Yi Hui, et al. Hole filling and optimization algorithm for depth images based on adaptive joint bilateral filtering [J]. Chinese Journal of Lasers, 2019, 46(10): 1009002.) is to perform a weighted average of each pixel value in the target depth map, where the weights are determined not only by the spatial distance between pixels but also by the similarity of pixel values ​​in the RGB image. For the target pixel p, the filtering result I(p) is a weighted sum of pixels in the neighborhood φ, as shown in the following formula:

[0067]

[0068] Where p is the depth value of the target pixel, I(p) is the final new depth value of the target pixel, q is the original depth value of the neighboring pixels in the original target depth map, and G... s (||pq||) is the spatial distance weight, G r (|G(q)-G(p)|) is the color similarity weight function;

[0069] To address the issue of inter-frame transitions in depth maps, a one-dimensional Kalman filter model is introduced. A prediction-update process is established for the depth value of each pixel to achieve inter-frame smoothing and improve stability and depth continuity.

[0070] Assuming depth value x k The changes satisfy the following state transition equation:

[0071] x k =Agx k-1 +w k

[0072] z k =Hgx k +v k

[0073] Where: x k x k-1 : Depth values ​​of frames k and k-1, in millimeters (mm);

[0074] A: State transition matrix, which describes the change pattern of depth values ​​between each frame;

[0075] w k x k Process noise and measurement noise follow a Gaussian distribution with a mean of 0 and a variance of Q.

[0076] z k : The measured depth value of the k-th frame.

[0077] H: Observation matrix, usually the identity matrix.

[0078] In this invention, since the image is captured in a static state,

[0079]

[0080] Kalman filtering consists of two stages: prediction and update, processing each frame of data recursively. The prediction stage has an error covariance matrix... for:

[0081]

[0082] in: The prediction error covariance matrix of the k-th frame; P k-1 The error covariance matrix of the (k-1)th frame;

[0083] During the update phase, the Kalman gain K k for:

[0084]

[0085] Where R is the error covariance matrix of the observation noise.

[0086] (3) Target recognition and attitude estimation module

[0087] This invention employs a method combining image features and point cloud registration to achieve object target recognition and 3D pose estimation. The SURF algorithm (Bay H, Tuytelaars T, Van Gool L. Surf: Speeded up robust features[C] / / Computer Vision–ECCV 2006:9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9. Springer Berlin Heidelberg, 2006:404-417.) was used to extract local feature points from the RGB image. Combined with FLANN to accelerate matching (Badri F, Yuniarno EM, Mardi SN S. 3D point cloud data registration based on multiviewimage using SIFT method for Djago temple relief reconstruction[C] / / 2015 4th International Conference on Instrumentation, Communications, Information Technology, and Biomedical Engineering (ICICI-BME). IEEE, 2015:191-195.), a matching relationship was established with the target template image. After filtering out erroneous matching points using the RANSAC algorithm, the image homography matrix is ​​calculated to obtain the target region ROI.

[0088] The core idea of ​​the SURF algorithm is based on detection and descriptor generation using the Hessian matrix. The algorithm divides the entire process into three main stages:

[0089] (1) Feature point detection: SURF uses the determinant value of the Hessian matrix to detect salient points in the image. To improve computational efficiency, the algorithm uses an integral image to accelerate the convolution operation of the Hessian matrix, which greatly reduces the amount of computation.

[0090] (2) Direction assignment: By analyzing the gradient information of the neighborhood of the feature point, SURF assigns a principal direction to each feature point to maintain the rotation invariance of the feature.

[0091] (3) Feature Description: SURF constructs feature descriptors based on Haar wavelet responses within the main direction and neighborhood of the feature points. For example... Figure 2 As shown in the feature description diagram, the descriptor forms a 64-dimensional or 128-dimensional feature vector by statistically analyzing the changes in image brightness, which has strong discriminative power and stability.

[0092] For the original point cloud data, this invention adopts the ROI region extraction method to project the optimized ToF depth map to generate point clouds, determine the target point cloud position based on the 2D image, separate the effective point cloud of the target object from the complex background, and extract the local point cloud corresponding to the ROI region. In order to obtain complete and high-precision model point cloud data, this invention uses an RGB-D camera for data acquisition. According to the actual application requirements, single-view or multi-view point cloud acquisition methods can be selected to ensure that the detailed information of the target object surface is fully presented. Load the predefined object model point cloud and use the ICP algorithm (Yang J, Li H, Campbell D, et al. Go-ICP: A globally optimal solution to 3D ICP point-set registration[J]. IEEE transactions on pattern analysis and machineintelligence, 2015, 38(11):2241-2254.) for three-dimensional registration. To improve efficiency and accuracy, a KD-Tree structure is introduced to accelerate the nearest neighbor search process.

[0093] In the field of point cloud registration, the ICP algorithm is one of the most widely used techniques. It does not rely on feature points but directly performs rigid registration based on point cloud data by minimizing the Euclidean distance error between two sets of points. The ICP algorithm is a classic iterative optimization method used to calculate the optimal rigid transformation matrix between two point clouds, namely the rotation matrix R and the translation vector T. Its core idea is:

[0094] The input is a given template point cloud. ROI target point cloud The output is a rigid transformation (R,T) such that the transformed point cloud Q′=RP+T, and the transformed Q′ and Q are aligned as much as possible.

[0095] The main steps for using the ICP algorithm are as follows:

[0096] (1) For each point p in the template point cloud i Find the nearest corresponding point q in the ROI target point cloud Q. i :

[0097]

[0098] Where ||g|| represents the Euclidean distance.

[0099] (2) Calculate the optimal rotation matrix R and translation vector T to minimize the error:

[0100]

[0101] This problem can be solved using the Singular Value Decomposition (SVD) method. The specific steps are as follows:

[0102] Calculate the mean of the template point cloud and the target point cloud:

[0103]

[0104] Calculate the decentralized points:

[0105]

[0106] Calculate the covariance matrix:

[0107]

[0108] Perform SVD decomposition on matrix H:

[0109] H=U∑V T

[0110] Calculate the rotation matrix:

[0111] R = VU T

[0112] If det(R) = -1, then correction is needed:

[0113] V′=[v1 v2 -v3],R=V′U T

[0114] Calculate the translation vector:

[0115]

[0116] (3) Update the point cloud P according to the calculated transformation matrix (R,T):

[0117] P′=RP+T

[0118] Then update the template point cloud and proceed with the next iteration.

[0119] (4) Calculate the root mean square error (RMS Error):

[0120]

[0121] If the error is below the set threshold e0 or the error decreases very little, the registration is considered complete; otherwise, proceed to the next iteration.

[0122] (5) Condition for stopping iteration:

[0123] The error change is less than the threshold e0:

[0124] |E k -E k-1 |<e0

[0125] The number of iterations has reached its maximum value:

[0126] k>k max

[0127] The pose information of the target relative to the model point cloud is obtained by point cloud alignment calculation, which provides the basic input for subsequent grasping action generation.

[0128] (4) Path planning and capture control module

[0129] To ensure that humanoid robots can efficiently, safely, and smoothly complete grasping operations in complex dynamic environments, this invention designs and implements a complete path planning and grasping control module based on the ROS platform. This module encompasses an improved RRT path planning algorithm, a layered collision detection mechanism, a dual-arm cooperative control strategy, and the grasping action execution flow. The relevant modules are integrated into the MoveIt motion control framework and support in-loop experimental verification on a real robot platform.

[0130] The Rapidly-Exploring Random Tree (RRT) algorithm (Ding Chengjun, Wang Zhenlin, Geng Yukun, et al. Efficient Sampling Adaptive RRT Algorithm for Mobile Robot Path Planning [J / OL]. Mechanical Science and Technology, 1-8 [2025-05-23].) is widely used in robot path planning due to its excellent high-dimensional space search capability. However, traditional RRT algorithms suffer from problems such as non-smooth paths, low expansion efficiency, and unstable obstacle avoidance performance. To address these shortcomings, this invention proposes the following two optimizations:

[0131] (1) Goal-biased Sampling

[0132] When expanding the tree structure, the system directly selects the target pose as the sampling point with a certain probability (e.g., 30%) to enhance the guidance of the search tree and effectively improve the path convergence speed and search efficiency.

[0133] (2) Parent Node Optimization and Path Pruning

[0134] After a new node is generated, a local search region is constructed centered on that node, and its parent node is dynamically updated to be the locally optimal connection point. Simultaneously, a trajectory reconnection algorithm is used to remove redundant nodes, resulting in a shorter and smoother generated path, reducing robotic arm jitter and energy consumption during execution.

[0135] Simulation results show that the improved RRT algorithm outperforms traditional methods in terms of path quality, planning time, and success rate, making it particularly suitable for path generation needs requiring rapid response in dynamic environments.

[0136] To ensure the feasibility of the grasping path and the safety of the robotic arm's movement, this invention introduces a two-layer collision detection mechanism based on geometric modeling for use in both single-arm and dual-arm systems:

[0137] Geometric modeling method: The Minimum Cylindrical Envelope (MCE) method is adopted to construct a compact geometric model based on the actual shape of the robotic arm, thereby improving detection accuracy and real-time performance.

[0138] Layered detection mechanism: For dual-arm collaborative tasks, this invention designs a master-slave arm decoupled detection strategy: first, the master arm path is planned and its discrete temporal pose is recorded, and then the master arm motion is introduced as a dynamic obstacle to avoid in the slave arm path planning, which significantly reduces the computational complexity.

[0139] The single-arm grasping control process includes: pre-grasping posture generation: a set of safe buffer postures is set in front of the target position to ensure that the grasping direction is consistent with the main axis of the object; synchronous control mechanism: the gripper starts to close simultaneously when it reaches the preset position at the end, and the system adaptively adjusts the closing speed and force according to the gripper status feedback; grasping exit path planning: after confirming successful gripping, a shortest obstacle avoidance path is replanned based on the current environment to place the object in the designated area.

[0140] Dual-arm grasping tasks are divided into two categories based on coordination requirements: Loosely Coordinated mode: Each arm performs an independent task, such as grasping multiple objects simultaneously from the left and right, planning its own path, and detecting collisions with each other in real time; Tightly Coordinated mode: Both arms work together on a single object to form a closed-loop system.

[0141] This invention proposes a master-slave arm temporal collaborative control strategy: fix the slave arm pose and treat it as an obstacle to plan the master arm path; solve the slave arm synchronous motion based on the master arm trajectory, and complete the coordination through the relative pose constraints between the end effectors of both arms and the object; perform synchronous time interpolation to ensure the dynamic consistency of both arms throughout the task.

[0142] This strategy is applicable to complex scenarios such as collaborative handling and folding assembly, improving the stability and task accuracy of the dual-arm system.

[0143] (5) Experimental verification

[0144] A robotic grasping experimental platform was built, with a seven-DOF industrial robotic arm as the main body, a custom gripper installed at the end effector, and an integrated RGB-D vision system. An eye-in-hand structure was used for near-field perception. The computing platform ran a virtual machine under Windows 11, internally deploying Ubuntu 20.04 and ROS Noetic. Simulations were performed using the ROS platform, demonstrating real-time vision processing and motion planning capabilities. Simulation diagrams are shown below. Figure 4 As shown, the entire system is built on the ROS framework, integrating perception, recognition, planning, and control modules, and combining MoveIt to complete trajectory generation and execution, ensuring the system's high efficiency and stability. To verify the performance of the proposed joint-layer perception-based grasping system, this invention designed single-arm grasping experiments and dual-arm collaborative grasping experiments. The single-arm grasping task, based on the Sawyer collaborative robotic arm and the Eye-in-Hand multimodal vision system, mainly evaluates the comprehensive performance of target recognition, pose estimation, and trajectory planning; the dual-arm collaborative task simulates a complex grasping environment, examining the system's stability and robustness in dynamic collaboration.

[0145] Example 1

[0146] Single-arm grasping experiment: A standard-shaped express delivery box was selected as the target, and positioning and grasping tests were conducted under normal lighting conditions. The positioning and grasping process is as follows: Figure 5 As shown in Table 1, 50 tests were conducted, and the average depth image filtering time (s), average object recognition and pose detection time (s), and positioning accuracy (mm) were statistically analyzed.

[0147] Example 2

[0148] Single-arm grasping experiment: A low-reflectivity object was selected as the target, and positioning and grasping tests were conducted against a complex material background. The positioning and grasping process is as follows: Figure 6 As shown in Table 1, 50 tests were conducted, and the average depth image filtering time (s), average object recognition and pose detection time (s), and positioning accuracy (mm) were statistically analyzed.

[0149] Table 1. Test results for Examples 1 and 2

[0150]

[0151] As can be seen from Table 1, the system can stably achieve a positioning accuracy within ±3mm under different object conditions, meeting the requirements of grasping operations.

[0152] Example 3

[0153] Dual-arm Collaborative Grasping Experiment: Aiming at intelligent book grasping applications, a human-machine collaborative platform was built. A seven-DOF Sawyer robotic arm, in conjunction with an auxiliary vacuum suction cup device, completed the book grasping and sorting tasks. Due to limitations in experimental equipment, this study adopted a single-arm, human-machine collaborative experimental scheme. The experimental platform consisted of a seven-DOF Sawyer collaborative robotic arm and a human operator, simulating a dual-arm collaborative book grasping task. The RGB image of the book is shown below. Figure 7 As shown, the TOF depth image is as follows Figure 8 As shown, the 3D point cloud map is as follows: Figure 9 As shown, the ROI point cloud map is as follows: Figure 10 As shown, the crawling process is as follows: Figure 11 As shown. The experiment simulated a library environment to examine the system's gripping stability and grasping success rate when facing open-structure targets (such as books), verifying the effectiveness and promotion potential of the robot joint layer perception system integrating ToF and monocular vision in practical application scenarios.

[0154] Although embodiments of the present invention have been shown and described above, it is understood that these embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions, and alterations to the above embodiments within the scope of the present invention without departing from its principles and spirit. The scope of protection of the present invention is defined by the claims and their equivalents.

Claims

1. A robot joint-level perception system integrating ToF and monocular vision, characterized in that, include: Perception module: integrates a ToF depth camera and a monocular RGB camera, and is installed at the end of the robot in an eye-in-hand structure; The sensing module uses Zhang Zhengyou's chessboard calibration method to calibrate the intrinsic parameters of the monocular RGB camera; Depth map optimization processing module: It adopts a two-stage optimization using joint bilateral filtering and Kalman filtering. Using the RGB image as the guide image, it performs edge-preserving filtering on the ToF depth map to suppress noise while preserving object contour information. Target recognition and pose estimation module: Target recognition uses the SURF algorithm to extract local feature points from RGB images, combines FLANN to accelerate matching and establish a matching relationship with the target template image, and then uses the RANSAC algorithm to filter out erroneous matching points to calculate the image homography matrix and obtain the target region ROI; Pose estimation uses KD-Tree-accelerated ICP point cloud registration to achieve six-degree-of-freedom pose estimation; Path planning and grasping control module: Based on the ROS platform, path planning and grasping control are achieved through an improved RRT path planning algorithm, a hierarchical collision detection mechanism, a single-arm grasping control process, and a dual-arm collaborative control strategy. The improved RRT path planning algorithm includes: (1) Target point biased sampling: When expanding the tree structure, the system directly selects the target pose as the sampling point with a certain probability to enhance the guidance of the search tree and effectively improve the path convergence speed and search efficiency. (2) Dynamic optimization of parent node and path pruning: After a new node is generated, a local search region is constructed with the node as the center, and its parent node is dynamically updated as the local optimal connection point; at the same time, the trajectory reconnection algorithm is used to remove redundant nodes, so that the generated path is shorter and smoother, and the mechanical arm shake or energy consumption is reduced during the execution process. The layered collision detection mechanism includes: (1) Construct the geometric model of the robotic arm based on the minimum cylinder envelope method; (2) Decoupling detection of master and slave arms in dual-arm task: First, plan the path of the master arm and record the temporal pose, and then use the movement of the master arm as a dynamic obstacle to plan the path of the slave arm.

2. The system according to claim 1, characterized in that, The single-arm grasping control process includes: (1) Pre-grabbing posture generation: Set a set of safety buffer postures in front of the target position to ensure that the grasping direction is consistent with the main axis of the object; (2) Synchronous control mechanism: The gripper starts closing simultaneously when it reaches the preset position at the end. The system adaptively adjusts the closing speed and force according to the gripper status feedback. (3) Grab and exit path planning: After confirming successful gripping, replan the shortest obstacle avoidance path based on the current environment to place the object in the designated area.

3. The system according to claim 1, characterized in that, The dual-arm cooperative control strategy includes: (1) Fix the position of the slave arm, treat it as an obstacle, and plan the path of the master arm; (2) Based on the inverse solution of the main arm trajectory, the slave arm moves synchronously, and coordination is achieved through the relative pose constraints between the ends of the two arms and the object; (3) Perform time synchronization interpolation to ensure the dynamic consistency of the two arms throughout the entire task.