Robotic grasping control method and apparatus

By acquiring 2D and depth images of the environment to generate 3D point clouds, and determining and evaluating candidate grasping poses, the problem of low intelligence level of robot grasping in complex environments is solved, and accurate and efficient grasping results are achieved.

CN122274992APending Publication Date: 2026-06-26伽利略(天津)技术有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
伽利略(天津)技术有限公司
Filing Date
2026-05-09
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, robots have a low level of intelligence in autonomous grasping in complex environments, and the selection of grasping poses relies on simple geometric rules, resulting in insufficient efficiency and accuracy.

Method used

By acquiring 2D and depth images of the environment, a 3D point cloud of the target object is generated, candidate grasping poses are determined, and multimodal evaluation is performed to select the optimal grasping pose, which is then combined with the robotic arm to perform the grasping action.

Benefits of technology

This has improved the accuracy and intelligence of the robot's autonomous grasping in complex environments, thereby enhancing the accuracy and efficiency of grasping.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122274992A_ABST
    Figure CN122274992A_ABST
Patent Text Reader

Abstract

This application relates to a robot grasping control method and apparatus. The method includes: acquiring real-time environmental images, including a two-dimensional environmental image and an environmental depth image; determining a target object region in the two-dimensional environmental image and generating a target three-dimensional point cloud corresponding to the target object region based on the environmental depth image; determining multiple first candidate points in the target three-dimensional point cloud and predicting a candidate grasping pose corresponding to each first candidate point; performing multimodal evaluation on each candidate grasping pose to obtain an evaluation result corresponding to each candidate grasping pose, and determining the optimal grasping pose among the candidate grasping poses based on the evaluation results; and controlling the robot to execute the corresponding grasping action based on the optimal grasping pose. This improves the intelligence level of autonomous grasping in complex environments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of robot control technology, and in particular to robot grasping control methods and devices. Background Technology

[0002] With the development of robot manipulation technology, the autonomous operation capability of robots in open environments has gradually become a research hotspot, and it is widely used in various fields such as industrial manufacturing, smart homes, and warehousing and logistics. Among them, robot grasping, as the core link of autonomous operation, directly determines the robot's work efficiency, operational accuracy, and environmental adaptability.

[0003] However, in existing technologies, the selection of grasping poses usually relies on simple geometric rules, resulting in a low level of intelligence in autonomous grasping of robots in complex environments.

[0004] There is currently no effective solution to the problem of low intelligence level in the autonomous grasping of robots in complex environments in related technologies. Summary of the Invention

[0005] This embodiment provides a robot grasping control method and apparatus, which aims to improve the level of intelligence in robot autonomous grasping in complex environments.

[0006] Firstly, this embodiment provides a robot grasping control method, the method comprising:

[0007] Acquire real-time environmental images; the environmental images include two-dimensional environmental images and environmental depth images;

[0008] The target object region in the two-dimensional environmental image is determined, and a target three-dimensional point cloud corresponding to the target object region is generated based on the environmental depth image.

[0009] Determine multiple first candidate points in the target 3D point cloud, and predict the candidate grasping pose corresponding to each first candidate point;

[0010] Multimodal evaluation is performed on each candidate grasping pose to obtain the evaluation result corresponding to each candidate grasping pose, and the optimal grasping pose among each candidate grasping pose is determined based on the evaluation result corresponding to each candidate grasping pose.

[0011] Based on the optimal grasping pose, the robot is controlled to perform the corresponding grasping action.

[0012] Secondly, this embodiment provides a robot grasping control device, the device comprising:

[0013] The acquisition module is used to acquire real-time environmental images; the environmental images include two-dimensional environmental images and environmental depth images.

[0014] The generation module is used to determine the target object region in the two-dimensional environmental image and generate a target three-dimensional point cloud corresponding to the target object region based on the environmental depth image.

[0015] The prediction module is used to determine multiple first candidate points in the target 3D point cloud and predict the candidate grasping pose corresponding to each first candidate point.

[0016] The evaluation module is used to perform multimodal evaluation on each of the candidate grasping poses, obtain the evaluation result corresponding to each candidate grasping pose, and determine the optimal grasping pose among the candidate grasping poses based on the evaluation result corresponding to each candidate grasping pose.

[0017] The control module is used to control the robot to perform the corresponding grasping action based on the optimal grasping pose.

[0018] Compared with related technologies, the robot grasping control method and apparatus provided in this embodiment acquire real-time environmental images, including two-dimensional environmental images and environmental depth images; determine the target object region in the two-dimensional environmental image, and generate a target three-dimensional point cloud corresponding to the target object region based on the environmental depth image; determine multiple first candidate points in the target three-dimensional point cloud, and predict the candidate grasping pose corresponding to each first candidate point; perform multimodal evaluation on each candidate grasping pose to obtain the evaluation result corresponding to each candidate grasping pose, and determine the optimal grasping pose among the candidate grasping poses based on the evaluation results corresponding to each candidate grasping pose; and control the robot to execute the corresponding grasping action based on the optimal grasping pose. In this way, by fusing two-dimensional and depth images to reconstruct the three-dimensional point cloud of the target object, the spatial position and shape information of the target can be accurately obtained. Based on this, candidate grasping points are selected and corresponding grasping poses are generated. Then, the optimal grasping pose is selected by combining multimodal evaluation, thereby achieving accurate evaluation of the grasping pose and improving the intelligence level of robot autonomous grasping in complex environments.

[0019] Details of one or more embodiments of this application are set forth in the following drawings and description to make other features, objects and advantages of this application more readily apparent. Attached Figure Description

[0020] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:

[0021] Figure 1This is a flowchart of an embodiment of the robot grasping control method provided in this application;

[0022] Figure 2 This is a flowchart of Embodiment 2 of the robot grasping control method provided in this application;

[0023] Figure 3 This is a flowchart of Embodiment 3 of the robot grasping control method provided in this application;

[0024] Figure 4 This is a flowchart of Embodiment 4 of the robot grasping control method provided in this application;

[0025] Figure 5 This is a flowchart of Embodiment 5 of the robot grasping control method provided in this application;

[0026] Figure 6 This is a schematic diagram of the candidate grasping pose projection results provided in this application;

[0027] Figure 7 This is a flowchart of Embodiment Six of the Robot Grasping Control Method provided in this application;

[0028] Figure 8 This is a flowchart of Embodiment Seven of the Robot Grasping Control Method provided in this application;

[0029] Figure 9 This is a structural block diagram of the robot grasping control device provided in this application. Detailed Implementation

[0030] To better understand the purpose, technical solution, and advantages of this application, the application is described and illustrated below in conjunction with the accompanying drawings and embodiments.

[0031] Unless otherwise defined, the technical or scientific terms used in this application shall have the general meaning understood by one of ordinary skill in the art to which this application pertains. Words such as “a,” “an,” “an,” “the,” “the,” and “these” used in this application do not indicate quantitative limitation and may be singular or plural. The terms “comprising,” “including,” “having,” and any variations thereof used in this application are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or device that comprises a series of steps or modules (units) is not limited to the listed steps or modules (units) but may include steps or modules (units) not listed, or may include other steps or modules (units) inherent to these processes, methods, products, or devices. Words such as “connected,” “linked,” and “coupled” used in this application are not limited to physical or mechanical connections but may include electrical connections, whether direct or indirect. “Multiple” used in this application refers to two or more. “And / or” describes the relationship between related objects, indicating that three relationships may exist; for example, “A and / or B” can represent: A alone, A and B simultaneously, and B alone. Normally, the character " / " indicates that the objects before and after it are in an "or" relationship. The terms "first," "second," "third," etc., used in this application are merely to distinguish similar objects and do not represent a specific order of objects.

[0032] This application provides a robot grasping control method and apparatus, which aims to improve the level of intelligence in robot autonomous grasping in complex environments.

[0033] The following are several specific embodiments to illustrate the technical solutions of this application in detail. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.

[0034] This embodiment provides a robot grasping control method. Figure 1 This is a flowchart of the robot grasping control method in this embodiment, as follows: Figure 1 As shown, the method includes the following steps:

[0035] Step S101: Acquire real-time environmental images; the environmental images include two-dimensional environmental images and environmental depth images.

[0036] Specifically, after controlling the robot's robotic arm to move to the pre-grasping pose, the camera mounted on the robot's arm acquires environmental images in real time. These environmental images are scene perception data of the robot's current working environment, specifically including two-dimensional environmental images and environmental depth images. For example, RGB-D images are acquired in real time using a depth camera.

[0037] Step S102: Determine the target object region in the two-dimensional environmental image, and generate the target three-dimensional point cloud corresponding to the target object region based on the environmental depth image.

[0038] For example, in one embodiment, based on a preset grasping target, target detection and recognition are performed on a two-dimensional environmental image to obtain the target object region in the two-dimensional environmental image. The preset grasping target can be a cup, a spoon, or various workpieces to be operated, and can be adaptively set according to the actual operation scenario, which will not be elaborated here.

[0039] Furthermore, depth information corresponding to each pixel within the target object region is extracted from the environmental depth image. Based on the extracted depth information and the camera intrinsic parameter matrix, a pinhole model is used to back-project each pixel within the target object region into 3D space to generate a target 3D point cloud corresponding to the target object region. The camera intrinsic parameter matrix refers to the intrinsic parameter matrix of the camera used to acquire the environmental images.

[0040] Optionally, in one possible implementation, generating a target 3D point cloud corresponding to the target object region based on the environmental depth image may include: converting the bounding box information of the target object region into a corresponding structured cue, and inputting the structured cue and the environmental 2D image into a segmentation model to obtain a probability map; the value corresponding to each pixel in the probability map is the confidence score that the pixel belongs to the target object; using a preset confidence threshold as the binarization boundary, binarizing the probability map based on the confidence score corresponding to each pixel to generate a target 2D mask; determining the depth information corresponding to each pixel in the target 2D mask based on the environmental depth image; and backprojecting each pixel in the target 2D mask into a corresponding 3D point (x, y, z) based on the depth information corresponding to each pixel and the camera intrinsic parameter matrix, and combining the various 3D points to obtain the target 3D point cloud.

[0041] The back projection calculation formula is as follows:

[0042]

[0043]

[0044]

[0045] In the formula, (u,v) represents the coordinates of each pixel in the target 2D mask; D(u,v) represents the depth information corresponding to the pixel with coordinates (u,v); c x c y f x f y These are the intrinsic parameters in the camera intrinsic parameter matrix.

[0046] After generating the target 3D point cloud, it can be further normalized to improve the stability of subsequent processing. This normalization process includes, but is not limited to, removing outliers, cropping local target regions, standardizing the number of input points, and normalizing the position and scale of the point cloud. Furthermore, to facilitate spatial feature encoding by the network, the target 3D point cloud can be converted into a voxel representation. Finally, the specific representation of the normalized target 3D point cloud is as follows: N0 is the number of 3D points in the normalized point cloud, and C0 is the attribute dimension of each 3D point.

[0047] Step S103: Determine multiple first candidate points in the target 3D point cloud and predict the candidate grasping pose corresponding to each first candidate point.

[0048] Optionally, feature extraction is performed on the target 3D point cloud to obtain the high-dimensional features corresponding to each 3D point in the target 3D point cloud. Each 3D point in the target 3D point cloud is determined as a second candidate point. For each second candidate point, based on the high-dimensional features corresponding to the second candidate point, a grasping quality prediction is performed to obtain the grasping quality prediction score corresponding to the second candidate point. The second candidate points are sorted from high to low according to the grasping quality prediction score, and a predetermined number of the top-ranked second candidate points are determined as first candidate points. Subsequently, grasping attribute prediction is performed on the high-dimensional features corresponding to each first candidate point to obtain the candidate grasping pose corresponding to each first candidate point.

[0049] Among them, the candidate grasping pose includes the coordinates of the first candidate point (i.e., the coordinates of the grasping center), the target approach direction corresponding to the first candidate point (i.e., the spatial approach vector of the robotic arm gripper moving toward the target object), the in-plane rotation angle (i.e., the angle parameter of the gripper rotating around the target approach direction), the grasping depth (i.e., the distance the gripper extends into the target object's gripping area along the target approach direction), and the gripper width (i.e., the spacing parameter of the gripper opening and closing).

[0050] It should be noted that the above feature extraction can be implemented using algorithms such as PointNet and DGCNN, and no specific limitation is made here.

[0051] In addition, optionally, after obtaining the candidate grasping poses corresponding to each first candidate point, further post-processing optimization can be performed on each candidate grasping pose. For example, spatial clustering can be performed on each candidate grasping pose, and multiple candidate grasping poses with high spatial overlap can be determined based on the clustering results. Then, candidate grasping poses with lower grasping quality prediction scores among the multiple candidate grasping poses with high spatial overlap can be removed. As another example, using the target's 3D point cloud, a micro-collision simulation can be established for each candidate grasping pose, and candidate grasping poses with physical occlusion on the grasping path can be removed based on the collision simulation results.

[0052] Step S104: Perform multimodal evaluation on each candidate grasping pose to obtain the evaluation result corresponding to each candidate grasping pose, and determine the optimal grasping pose among the candidate grasping poses based on the evaluation results corresponding to each candidate grasping pose.

[0053] In practical implementation, for example, the 6D pose parameters corresponding to the candidate grasping pose are determined. Based on the 6D pose parameters and the preset joint limit thresholds of the robot's robotic arm, an evaluation sub-item is constructed, and another evaluation sub-item is constructed based on the user's grasping command used to indicate the grasping target. The two evaluation sub-items are weighted to construct a multi-dimensional joint evaluation function, and the candidate grasping pose is evaluated in a multi-modal manner according to the multi-dimensional joint evaluation function.

[0054] It should be noted that the presentation format of the evaluation results can be flexibly chosen. For example, if the evaluation results are presented in the form of an evaluation score, a higher score indicates better feasibility and fit of the candidate grasping pose, and the candidate grasping pose with the highest evaluation score is selected as the optimal grasping pose. As another example, if the evaluation results are presented in the form of an evaluation level, then the candidate grasping pose with the highest evaluation level is selected as the optimal grasping pose.

[0055] Step S105: Based on the optimal grasping pose, control the robot to execute the corresponding grasping action.

[0056] In this step, the 6D pose parameters (i.e., the target pose in the camera coordinate system) corresponding to the optimal grasping pose are pre-determined, and the 6D pose parameters are converted into the actual physical movements of the robotic arm. The specific conversion formula is as follows:

[0057]

[0058] In the formula, The target pose in the robot's base coordinate system; For rigid body transformation from camera to end point obtained from hand-eye calibration; The position of the robotic arm end effector under the base arm; To capture the pose representation in the camera coordinate system, all the above matrices are 4×4 homogeneous transformation matrices, which contain the rotation matrix R and the translation vector T.

[0059] After the coordinate transformation is complete, the robotic arm executes the grasping action according to the preset phased action logic: first, it moves to a preset offset above the target pose (for example, preset offset d). preThe robot arm is positioned at a pre-grabbing position of 0.1m. This step adjusts the approach direction of the end effector to avoid collisions or interference with the target object or the surrounding environment during direct approach. Once the pre-grabbing position is adjusted, the robot arm descends smoothly to the target gripping point along the preset approach direction and controls the gripper to close to complete the gripping action. After the gripping action is confirmed, the robot arm slowly lifts the gripped object, removes it from the initial gripping area, and finally moves it back to the initial zero position of the robot arm, completing the entire gripping execution process.

[0060] The robot grasping control method provided in this embodiment acquires real-time environmental images, including two-dimensional environmental images and environmental depth images; determines the target object region in the two-dimensional environmental image, and generates a target three-dimensional point cloud corresponding to the target object region based on the environmental depth image; determines multiple first candidate points in the target three-dimensional point cloud, and predicts the candidate grasping pose corresponding to each first candidate point; performs multimodal evaluation on each candidate grasping pose to obtain the evaluation result corresponding to each candidate grasping pose, and determines the optimal grasping pose among the candidate grasping poses based on the evaluation results corresponding to each candidate grasping pose; and controls the robot to execute the corresponding grasping action based on the optimal grasping pose. In this way, by fusing two-dimensional images and depth images to reconstruct the three-dimensional point cloud of the target object, the spatial position and shape information of the target can be accurately obtained. On this basis, candidate grasping points are selected and corresponding grasping poses are generated. Then, the optimal grasping pose is selected by combining multimodal evaluation, thereby achieving accurate evaluation of the grasping pose and improving the intelligence level of robot autonomous grasping in complex environments.

[0061] Figure 2 This is a flowchart of Embodiment 2 of the robot grasping control method in this example. Please refer to... Figure 2 The method provided in this embodiment, based on the above embodiments, determines multiple first candidate points in the target 3D point cloud, and may include the following steps:

[0062] Step S201: Extract features from the target 3D point cloud to obtain the high-dimensional features corresponding to each 3D point in the target 3D point cloud.

[0063] For example, in one embodiment, the target 3D point cloud is input into a feature extraction network. The specific processing of the feature extraction network includes: performing sparse convolution on each 3D point in the target 3D point cloud and the neighboring points of each 3D point to obtain the local geometric features corresponding to each 3D point; and using an encoder-decoder structure to perform multi-scale aggregation on the local geometric features corresponding to each 3D point to obtain multi-scale context features; and then performing point-by-point feature mapping on the multi-scale context features to obtain a high-dimensional feature tensor aligned with the spatial resolution of the target 3D point cloud; the high-dimensional feature tensor includes the high-dimensional features corresponding to each 3D point.

[0064] In this way, the feature representation of a single 3D point can simultaneously integrate local neighborhood geometric details and large-scale structural context information, enabling more accurate differentiation between stable and unstable grasping regions on the target object, providing a reliable feature basis for subsequent grasping quality assessment and grasping pose prediction.

[0065] Step S202: Determine multiple 3D points in the target 3D point cloud as second candidate points, and determine the target approach direction corresponding to each second candidate point.

[0066] Optionally, in one possible implementation, determining multiple 3D points in the target 3D point cloud as second candidate points may include:

[0067] (1) Perform a feasibility assessment on the high-dimensional features corresponding to each three-dimensional point to obtain an assessment score for each three-dimensional point; the assessment score is used to indicate the feasibility of using the three-dimensional point as the crawling center.

[0068] (2) Compare the evaluation score corresponding to each three-dimensional point with the preset score threshold to determine the three-dimensional points whose evaluation scores are greater than the preset score threshold;

[0069] (3) Each three-dimensional point whose evaluation score is greater than the preset score threshold is determined as the second candidate point.

[0070] Specifically, the feasibility assessment is represented as follows:

[0071]

[0072] In the formula, The crawling feasibility assessment branch is used to evaluate the feasibility of using 3D points as crawling centers; To evaluate scores, it is used to indicate three-dimensional points. Feasibility of using it as a crawling center; Representing a three-dimensional point The corresponding high-dimensional features.

[0073] Subsequently, the evaluation score corresponding to each 3D point is compared with a preset score threshold to determine the 3D points whose evaluation scores are greater than the preset score threshold, and each 3D point whose evaluation score is greater than the preset score threshold is determined as the second candidate point.

[0074] In this way, by accurately quantifying the feasibility of grasping each 3D point through the grasping feasibility assessment branch, and combining the score threshold to screen the second candidate point, it helps to reduce the amount of subsequent calculations, thereby improving efficiency, while ensuring the quality of candidate points, and providing reliable support for subsequent grasping pose prediction and grasping success.

[0075] Furthermore, proximity direction prediction is performed on the high-dimensional features corresponding to each second candidate point to obtain the grasping adaptation score of each second candidate point under different proximity directions (e.g., the grasping adaptation score under 300 pre-discretely sampled proximity directions). The proximity direction with the highest grasping adaptation score is determined as the target proximity direction corresponding to the second candidate point. The specific representation of proximity direction prediction is as follows:

[0076]

[0077] In the formula, For proximity direction prediction function; This represents the capture fit score of the second candidate point in each approach direction; This represents the high-dimensional feature corresponding to the second candidate point.

[0078] Step S203: For each second candidate point, construct a neighborhood point set corresponding to each second candidate point in the target 3D point cloud with the second candidate point as the center and the target approach direction corresponding to the second candidate point as the direction axis, and extract the high-dimensional features of each point in the neighborhood point set to obtain a local orientation feature set.

[0079] Specifically, within the target's 3D point cloud, a neighborhood set of each second candidate point is constructed, centered on the second candidate point and with the target approach direction corresponding to the second candidate point as the direction axis. .in, This represents the neighborhood point set corresponding to the second candidate point; Represents each point within a neighborhood set; Indicates Centered on, along A directional local spatial neighborhood.

[0080] Furthermore, high-dimensional features are extracted from each point within the neighborhood point set to obtain a local directional feature set. .

[0081] Step S204: Perform crawling quality prediction on the local orientation feature set corresponding to each second candidate point to obtain the crawling quality prediction score corresponding to each second candidate point, and select a preset number of second candidate points as first candidate points based on the crawling quality prediction scores corresponding to each second candidate point.

[0082] In this step, the capture quality prediction is performed on the local directional feature set corresponding to each second candidate point, as shown below:

[0083]

[0084] In the formula, To capture the quality prediction branch; The capture quality prediction score corresponding to the second candidate point; It is a set of locally oriented features.

[0085] Furthermore, the second candidate points are sorted from highest to lowest according to their crawling quality prediction scores, and a predetermined number of the top-ranked second candidate points are selected as first candidate points. For example, the top 5 ranked second candidate points are selected as first candidate points.

[0086] The method provided in this embodiment completes the preliminary screening of candidate points through grasping feasibility assessment, matches the optimal approach direction, and then achieves accurate grasping quality scoring based on local feature extraction of directional neighborhood. This enables the precise elimination of invalid candidate points, ensuring that the first candidate point obtained from the screening fits the target spatial structure and the actual grasping logic of the robotic arm.

[0087] Figure 3 This is a flowchart of Embodiment 3 of the robot grasping control method in this example. Please refer to it. Figure 3 The method provided in this embodiment, based on the above embodiments, predicts the candidate grasping pose corresponding to each first candidate point, and may include the following steps:

[0088] Step S301: Based on the local orientation feature set corresponding to each first candidate point, predict multiple grasping attributes corresponding to each first candidate point; the multiple grasping attributes include in-plane rotation angle, grasping depth and gripper width.

[0089] Specifically, rotation angle prediction is performed on the local orientation feature set corresponding to each first candidate point, as shown below:

[0090]

[0091] In the formula, For the rotation angle prediction branch; The predicted in-plane rotation angle; It is a set of locally oriented features.

[0092] For each first candidate point, the local directional feature set corresponding to the crawling depth is predicted, as shown below:

[0093]

[0094] In the formula, To capture the depth prediction branch; The predicted crawling depth; It is a set of locally oriented features.

[0095] Furthermore, the gripper width is predicted for the local orientation feature set corresponding to each first candidate point, as detailed below:

[0096]

[0097] In the formula, For gripper width prediction branch; The predicted gripper width; It is a set of locally oriented features.

[0098] Step S302: Combine each first candidate point, the target approach direction corresponding to each first candidate point, and multiple grasping attributes to obtain the candidate grasping pose corresponding to each first candidate point.

[0099] In practical implementation, the candidate grasping pose corresponding to each first candidate point can be represented as: .in, This indicates the coordinates of the first candidate point, i.e., the coordinates of the capture center. Indicates the direction of approach to the target; Indicates the rotation angle within the plane; Indicates the crawl depth; Indicates the width of the gripper.

[0100] It should be noted that the predictions of the grasping center, optimal approach direction, and various grasping attributes mentioned above can be achieved based on a pre-built prediction model. During the training phase of the prediction model, a complex dataset covering scenarios such as densely stacked objects, multiple materials, and reflective objects is constructed, and a multi-dimensional composite loss function is built to optimize the model. For example, this composite loss function consists of a weighted average of classification loss, regression loss, and width loss. The classification loss ensures accurate identification of the graspable area, the regression loss minimizes the cosine distance between the predicted pose and the actual grasping direction, and the width loss constrains the fit between the gripper opening and the actual size of the target object. Simultaneously, an online hard example mining mechanism is introduced during model training to increase the training weights of difficult samples such as those with low confidence and those prone to slippage, thereby further enhancing the model's prediction robustness under extreme conditions.

[0101] The method provided in this embodiment predicts key grasping attributes such as in-plane rotation angle, grasping depth, and gripper width based on a set of local orientation features. This enables precise matching of the gripping shape and spatial constraints of the target object, providing a reliable parameter basis for subsequent pose evaluation and robotic arm execution. Furthermore, it first predicts the candidate grasping center and its optimal approach direction, and then predicts the remaining grasping attributes in parallel under the constraint of this approach direction. This prunes the highly ambiguous pose space into a less ambiguous subspace, which helps improve training stability and reduce computational overhead.

[0102] Figure 4 This is a flowchart of Embodiment 4 of the robot grasping control method in this example. Please refer to... Figure 4The method provided in this embodiment, based on the above embodiments, determines the target object region in a two-dimensional environmental image, and may include the following steps:

[0103] Step S401: Obtain the user capture command input in real time.

[0104] Specifically, the user grabbing command supports multiple real-time input methods, including text input and voice input. Furthermore, the user grabbing command can be an explicit command that directly specifies the grabbing target, or an implicit scenario command that implies the user's grabbing intention. For example, the input user grabbing command could be "Please grab the water bottle," or "I'm thirsty," and so on.

[0105] Step S402: Input the user's grabbing command and the two-dimensional image of the environment into the visual language model to obtain the target object region in the two-dimensional image of the environment; the visual language model is used to extract the deep target semantics corresponding to the user's grabbing command and the image spatial features corresponding to the two-dimensional image of the environment, perform cross-modal matching between the deep target semantics and the image spatial features, and locate the target object region in the two-dimensional image of the environment based on the matching results.

[0106] It should be noted that the specific processing steps of the Visual Language Model (VLM) include: semantic parsing of the input user grabbing command to extract deep target semantics; feature extraction of the two-dimensional environmental image to extract spatial features of the image, including but not limited to visual information such as color, texture, contour, and spatial position of each object in the image; finally, the model performs cross-modal matching between the extracted deep target semantics and the image spatial features, and through the correspondence between semantics and visual features, it filters out target objects that match the user's grabbing intent, thereby accurately locating the region of the target object in the two-dimensional environmental image, i.e., the target object region.

[0107] The target object region output by the visual language model can be the bounding box information of the target object. Furthermore, the bounding box information of the target object output by the visual language model is the normalized bounding box of the target object, which can then be mapped back to the original image resolution to obtain the target bounding box in the pixel coordinate system.

[0108] The method provided in this embodiment achieves cross-modal accurate matching between the deep semantics of user commands and the spatial features of two-dimensional images of the environment through a visual language model, effectively solving the problem of insufficient semantic understanding in complex open scenes and realizing rapid and accurate localization of target object regions.

[0109] Figure 5 This is a flowchart of Embodiment 5 of the robot grasping control method in this example. Please refer to it. Figure 5The method provided in this embodiment, based on the above embodiments, performs multimodal evaluation on each candidate grasping pose to obtain the evaluation result corresponding to each candidate grasping pose, and may include the following steps:

[0110] Step S501: Construct a virtual gripper corresponding to each candidate grasping pose and determine the coordinates of multiple key points on each virtual gripper; the multiple key points include at least the two farthest endpoints on the gripper base along the gripper opening and closing direction, and the tip of each gripper finger.

[0111] Specifically, based on the parameters such as the grasping center, target approach direction, gripper width, and grasping depth contained in each candidate grasping pose, a corresponding virtual gripper is constructed in three-dimensional space according to the structural dimensions of the robotic arm gripper.

[0112] Furthermore, several key points on each virtual gripper are identified that characterize the core contour of the gripper and its gripping position. These include the two farthest endpoints on the gripper base along the opening and closing direction, and the tip of each gripper finger (for example, when the virtual gripper is a standard two-finger gripper, the key points include the tips of the left and right gripper fingers). The two farthest endpoints on the gripper base along the opening and closing direction are used to determine the base boundary and the overall spatial reference, while the tips of the gripper fingers reflect the gripping contact position and the actual opening and closing distance.

[0113] Step S502: Based on the camera intrinsic parameter matrix, the coordinates of multiple key points on each virtual gripper are projected onto the environmental two-dimensional image, and the corresponding grasping wireframe markers are generated based on the projection results to obtain the gripper projection image corresponding to each virtual gripper; the camera intrinsic parameter matrix is ​​the intrinsic parameter matrix corresponding to the camera used to acquire the environmental two-dimensional image.

[0114] Specifically, based on the camera intrinsic parameter matrix of the acquired 2D environmental image, the coordinates of each key point on each virtual gripper are projected onto the pixel coordinate system of the 2D environmental image using a pinhole camera model. The specific projection calculation formula is as follows:

[0115]

[0116] In the formula, s is the scale factor (i.e., depth z); (u,v) represents the pixel coordinates in the 2D image of the environment; K represents the camera intrinsic parameter matrix; and (x,y,z) represents the coordinates of the key points.

[0117] Furthermore, based on the actual structural relationship of the grippers, lines are drawn connecting the key points after projection to generate corresponding gripping wireframe markers. Finally, the gripping wireframe markers are fused with the 2D image of the environment to obtain the gripper projection image corresponding to each virtual gripper, and a unique ID number is assigned to each gripper projection image. The gripping wireframe markers can intuitively reflect information such as the gripping position, the gripper opening and closing range, and the approach direction of the gripper.

[0118] In this way, the candidate grasping pose is transformed into a visual image symbol with geometric semantics, providing an intuitive image-level basis for subsequent evaluation of the grasping rationality.

[0119] For example, refer to Figure 6 As shown, Figure 6 In section A), a virtual gripper constructed in 3D space is projected, and corresponding gripping wireframe markers are generated based on the projection results. The 2D environmental image with the gripping wireframe markers is shown below. Figure 6 As shown in Figure B), the green lines represent the gripper base, the red lines represent the gripper fingers, and the blue lines with arrows represent the approach vector (i.e., the approach direction).

[0120] Step S503: Determine the 6D pose parameters corresponding to each virtual gripper based on the candidate grasping pose corresponding to each virtual gripper.

[0121] It should be noted that each candidate grasping pose includes parameters such as grasping center, target approach direction, in-plane rotation angle, gripper width, and grasping depth. This step, based on these parameters, will integrate and calculate them into the standard 6D pose parameters used in the field of robotic arm control. Among these, three degrees of freedom are the three-dimensional spatial coordinates of the grasping center, and the other three degrees of freedom are rotational parameters characterizing the spatial orientation of the gripper.

[0122] Step S504: For each virtual gripper, based on the gripper projection image corresponding to the virtual gripper, the 6D pose parameters corresponding to the virtual gripper, the preset joint limit threshold of the robot's robotic arm, and the user's gripping command, perform multimodal evaluation on the candidate gripping pose to obtain the evaluation result corresponding to the candidate gripping pose.

[0123] For example, in one embodiment, a first evaluation sub-item is constructed based on the gripper projection image corresponding to the virtual gripper; a second evaluation sub-item is constructed based on the 6D pose parameters corresponding to the virtual gripper and the preset joint limit threshold of the robot's robotic arm; a third evaluation sub-item is constructed based on the gripper projection image corresponding to the virtual gripper and the user's grasping command; a multi-dimensional joint evaluation function is constructed based on the first evaluation sub-item, the second evaluation sub-item, and the third evaluation sub-item; and a multi-modal evaluation is performed on the candidate grasping pose based on the multi-dimensional joint evaluation function.

[0124] In practical implementation, a multi-dimensional joint evaluation function can be constructed by pre-setting weights and fusing them. The specific expression is as follows:

[0125]

[0126] In the formula, , and For each evaluation sub-item, there is a dynamic weighting coefficient. R represents the 6D pose parameters corresponding to the virtual gripper; I represents the preset joint limit threshold of the robot's robotic arm; T represents the gripper projection image corresponding to the virtual gripper; and T represents the user's grasping command. This is the first evaluation sub-item; This is the second evaluation sub-item; This is the third evaluation sub-item.

[0127] The first evaluation sub-item is scored by cross-validation of visual features and numerical parameters. The specific judgment rules include, but are not limited to: (1) detecting whether the gripping wireframe marks in the gripper projection image have serious overlap or collapse into clusters. If there is serious overlap or collapse into clusters, it indicates that the gripping posture is in the direction of kinematic degradation, and a low evaluation score is given; (2) calculating the joint rotation angle of the robotic arm according to the 6D pose parameters corresponding to the virtual gripper. If the joint rotation angle of the robotic arm approaches the preset joint limit threshold, it is determined that there is a risk of exceeding the limit of movement, and a low score is given or the candidate gripping posture is directly eliminated; (3) verifying whether the approach vector (i.e. the approach direction) indicated by the gripping wireframe marks in the gripper projection image is the normal movement direction of the robotic arm. If it is not the normal movement direction of the robotic arm, a low score is given or the candidate gripping posture is eliminated. The second evaluation sub-item is based on the perspective relationship and geometric feature analysis of the gripper projection image to achieve stability evaluation. The specific judgment rules include, but are not limited to: (1) For long strip objects, it is judged whether the gripping base line in the gripping frame mark is parallel to the short axis of the object. If the gripping base line is not parallel to the short axis of the object, a low score is given; For circular objects, it is judged whether the gripping center in the gripping frame mark is aligned with the geometric center of the object. If the gripping center is not aligned with the geometric center of the object (for example, the gripping center is close to the edge or tip), a low score is given; (2) It is judged whether the finger projection line in the gripping frame mark is shorter than the gripping base line. If the finger projection line is longer than the gripping base line, the candidate gripping pose is judged to be a tilted tangential gripping with poor stability, and a low score is given; (3) It is detected whether each finger projection line visually crosses the two sides of the target object (i.e., whether it is distributed on both sides of the target object and does not intersect the outline of the target object at the pixel level). If this condition is not met, a low score is given. The third evaluation sub-item assesses the semantic matching degree between the candidate grasping pose and the user's grasping command. Specifically, it identifies the specific part of the target object marked with a grasping wireframe in the gripper projection image and evaluates the semantic matching degree between this specific part and the user's grasping command. For example, if the user's grasping command is "hand me the cup," and the specific part of the target object is the cup handle, a high score is assigned; if the user's grasping command is "hand me the cup," and the specific part of the target object is the rim of the cup, etc., the candidate grasping pose is determined to be inconsistent with human interaction habits and is assigned a low score.

[0128] In practice, the individual judgment results within each evaluation sub-item are weighted and fused to obtain the comprehensive score for the corresponding evaluation sub-item; then, the scores of the three evaluation sub-items are weighted and calculated to obtain the total multimodal evaluation score for the candidate grasping pose. Furthermore, after obtaining the total multimodal evaluation score for each candidate grasping pose, the candidate grasping pose with the highest total multimodal evaluation score is selected as the optimal grasping pose.

[0129] It should be noted that the above multimodal evaluation process can be uniformly implemented by a visual language model, that is, the gripper projection image corresponding to the virtual gripper, the 6D pose parameters corresponding to the virtual gripper, the preset joint limit threshold of the robot's robotic arm, and the user's grasping command are input into the visual language model, and the above multimodal evaluation is performed on the candidate grasping poses through the visual language model.

[0130] The method provided in this embodiment transforms candidate grasping poses into visual image symbols with geometric semantics through cross-dimensional projection mapping technology from three-dimensional to two-dimensional. Then, it performs multimodal joint evaluation from three dimensions: visual projection rationality, robotic arm motion constraint feasibility, and user grasping intent matching degree. This allows for a comprehensive prediction of the safety, feasibility, and adaptability of the pose before grasping, effectively avoiding risks such as joint over-limit, motion interference, and unreasonable posture. At the same time, it ensures that the grasping scheme closely matches the user's intent and the actual scenario, significantly improving the reliability of the optimal grasping pose, thereby guaranteeing the safety and absolute reach of the grasping action.

[0131] Furthermore, the aforementioned evaluation method enables robots to make grasping decisions in unstructured environments that conform to both spatial geometric constraints and specific task semantic logic, providing a high success rate guarantee for subsequent complex human-computer interaction and mobile operation tasks, especially for applications that require high accuracy of grasping logic, physical stability, and fault tolerance, such as item sorting under open vocabulary and service robot butler applications.

[0132] Figure 7 This is a flowchart of Embodiment Six of the robot grasping control method in this example. Please refer to it. Figure 7 The method provided in this embodiment, based on the above embodiments, after controlling the robot to perform the corresponding grasping action according to the optimal grasping pose, may include the following steps:

[0133] Step S701: Generate corresponding verification prompt words according to the user's grasping command; the verification prompt words are used to instruct the visual language model to perform verification tasks on the robot's gripper image; the verification tasks include determining whether there is an object in the gripper and whether the object in the gripper is the target object corresponding to the user's grasping command.

[0134] For example, when the user's grab command is "Give me a spoon", the generated verification prompt is "Please determine whether there is an object in the gripper of the current gripper image. If there is an object, identify whether the object in the gripper is a spoon".

[0135] Step S702: Acquire the current gripper image of the robot, and input the current gripper image and verification prompt words into the visual language model to obtain the verification result corresponding to the current gripper image.

[0136] It should be noted that the verification results include the judgment result of whether there is an object in the gripper, and the recognition result of whether the object in the gripper is the target object corresponding to the user's grabbing command.

[0137] In step S703, when the verification result indicates that there is an object in the gripper in the current gripper image, and the object in the gripper is the target object corresponding to the user's grasping command, the robot is determined to have successfully grasped the object, and the robot is controlled to perform a return-to-zero action.

[0138] Specifically, if the verification result indicates that there is an object in the gripper of the current gripper image, and the object in the gripper is the target object corresponding to the user's grasping command, then the robot is determined to have successfully grasped the object, and the robot's robotic arm is controlled to perform a return-to-zero action.

[0139] Step S704: If the verification result indicates that there is no object in the gripper in the current gripper image, or if there is an object in the gripper in the current gripper image but the object in the gripper is not the target object corresponding to the user's grasping command, the robot is determined to have failed to grasp. A grasping failure diagnosis result is generated based on the current gripper image so as to dynamically adjust the previous grasping strategy based on the diagnosis result.

[0140] In this step, if the verification result indicates that there is no object in the gripper of the current gripper image, or if there is an object in the gripper of the current gripper image, but the object in the gripper is not the target object corresponding to the user's grasping command, then the robot grasping is determined to have failed, and a grasping anomaly type label corresponding to the verification result is output.

[0141] At this point, the robotic arm's execution is halted, triggering a cognitive retry mechanism led by a visual language model. Specifically, the visual language model diagnoses the grasping failure of the current gripper image, obtaining a diagnosis result. Based on this diagnosis, the previous grasping strategy is dynamically modified and adjusted. For example, if the diagnosis result indicates a grasping pose deviation leading to an empty grasp (i.e., no object is present in the gripper), the accuracy of the visual language model's field-of-view centering determination is improved in the subsequent pre-grasping stage. If the diagnosis result indicates a poor grasping center selection leading to slippage, the corresponding penalty weight is fed back to the grasping decision optimization module to eliminate the unreasonable candidate pose and select the next highest-ranked and more stable grasping pose.

[0142] After dynamic adjustment, the robotic arm opens its gripper and retracts to the pre-grabbing safe height, and re-executes the aforementioned perception detection, pose decision-making, and control execution processes until the visual language model verifies successful grasping, thus forming a complete embodied intelligent error correction closed loop.

[0143] The method provided in this embodiment deeply integrates a visual language model into the entire process of perception, decision-making, execution, and anomaly verification, constructing a complete closed loop with autonomous diagnosis and correction capabilities. This significantly improves the success rate of data capture and the robustness of the system in complex, unstructured environments. Simultaneously, this closed-loop data capture scheme effectively overcomes problems such as fragmented task chains, insufficient semantic understanding, and mismatch between perception and execution, achieving end-to-end collaborative adaptation of user intent understanding, visual positioning, and physical execution.

[0144] The present embodiment will now be described and illustrated through preferred embodiments.

[0145] Figure 8 This is a flowchart of Embodiment Seven of the robot grasping control method in this example, as follows: Figure 8 As shown, the robot grasping control method includes the following steps:

[0146] Step S801: Obtain the real-time input user grabbing command and the real-time acquired environmental image; the environmental image includes a two-dimensional environmental image and an environmental depth image, and input the user grabbing command and the two-dimensional environmental image into the visual language model to obtain the target object region in the two-dimensional environmental image.

[0147] Step S802: Generate a target two-dimensional mask corresponding to the target object region, and determine the depth information corresponding to each pixel in the target two-dimensional mask according to the environmental depth image. Based on the depth information corresponding to each pixel and the camera intrinsic parameter matrix, back-project each pixel in the target two-dimensional mask into a corresponding three-dimensional point, and combine the three-dimensional points to obtain the target three-dimensional point cloud.

[0148] Step S803: Extract features from the target 3D point cloud to obtain the high-dimensional features corresponding to each 3D point in the target 3D point cloud, evaluate the feasibility of capturing the high-dimensional features corresponding to each 3D point, obtain the evaluation score corresponding to each 3D point, and determine the 3D points with evaluation scores greater than the preset score threshold as the second candidate points.

[0149] Step S804: For each second candidate point, construct a neighborhood point set corresponding to each second candidate point in the target 3D point cloud with the second candidate point as the center and the target approach direction corresponding to the second candidate point as the direction axis, and extract the high-dimensional features of each point in the neighborhood point set to obtain a local orientation feature set.

[0150] Step S805: Perform a capture quality prediction on the local orientation feature set corresponding to each second candidate point to obtain a capture quality prediction score for each second candidate point, and select a preset number of second candidate points as first candidate points based on the capture quality prediction scores for each second candidate point.

[0151] Step S806: Based on the local orientation feature set corresponding to each first candidate point, predict multiple grasping attributes corresponding to each first candidate point; the multiple grasping attributes include in-plane rotation angle, grasping depth and gripper width; and combine each first candidate point, the target approach direction corresponding to each first candidate point and the multiple grasping attributes to obtain the candidate grasping pose corresponding to each first candidate point.

[0152] Step S807: Construct a virtual gripper corresponding to each candidate grasping pose and determine the coordinates of multiple key points on each virtual gripper; the multiple key points include at least the two farthest endpoints on the gripper base along the gripper opening and closing direction, and the tip of each gripper finger.

[0153] Step S808: Based on the camera intrinsic parameter matrix, project the coordinates of multiple key points on each virtual gripper onto the environmental 2D image, and generate corresponding gripping wireframe markers based on the projection results to obtain the gripper projection image corresponding to each virtual gripper; the camera intrinsic parameter matrix is ​​the intrinsic parameter matrix corresponding to the camera used to acquire the environmental 2D image.

[0154] Step S809: Based on the candidate grasping pose corresponding to each virtual gripper, determine the 6D pose parameters corresponding to each virtual gripper, and perform multimodal evaluation on the candidate grasping poses based on the gripper projection image corresponding to the virtual gripper, the 6D pose parameters corresponding to the virtual gripper, the preset joint limit threshold of the robot's robotic arm, and the user's grasping command, to obtain the evaluation results corresponding to the candidate grasping poses.

[0155] Step S810: Based on the evaluation results of each candidate grasping pose, determine the optimal grasping pose among the candidate grasping poses, and control the robot to perform the corresponding grasping action based on the optimal grasping pose.

[0156] Step S811: Generate corresponding verification prompts based on user gripping instructions, collect the current gripper image of the robot, and input the current gripper image and verification prompts into the visual language model to obtain the verification result corresponding to the current gripper image.

[0157] In step S812, when the verification result indicates that there is an object in the gripper in the current gripper image, and the object in the gripper is the target object corresponding to the user's grasping command, the robot is determined to have successfully grasped the object, and the robot is controlled to perform a return-to-zero action.

[0158] In step S813, if the verification result indicates that there is no object in the gripper in the current gripper image, or if the object in the gripper is not the target object corresponding to the user's grasping command, the robot grasping is determined to have failed, and a cognitive retry mechanism dominated by the visual language model is triggered.

[0159] It should be noted that the steps shown in the above process or in the flowchart of the accompanying figures can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.

[0160] This embodiment also provides a robot grasping control device for implementing the above embodiments and preferred embodiments; details already described will not be repeated. The terms "module," "unit," "subunit," etc., used below refer to combinations of software and / or hardware that implement a predetermined function. Although the device described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.

[0161] Figure 9 This is a structural block diagram of the robot grasping control device in this embodiment, as shown below. Figure 9 As shown, the device includes:

[0162] The acquisition module 10 is used to acquire real-time environmental images; the environmental images include two-dimensional environmental images and environmental depth images.

[0163] The generation module 20 is used to determine the target object region in the two-dimensional environmental image and generate the target three-dimensional point cloud corresponding to the target object region based on the environmental depth image.

[0164] The prediction module 30 is used to determine multiple first candidate points in the target 3D point cloud and predict the candidate grasping pose corresponding to each first candidate point;

[0165] The evaluation module 40 is used to perform multimodal evaluation on each candidate grasping pose, obtain the evaluation result corresponding to each candidate grasping pose, and determine the optimal grasping pose among each candidate grasping pose based on the evaluation result corresponding to each candidate grasping pose.

[0166] The control module 50 is used to control the robot to perform the corresponding grasping action according to the optimal grasping pose.

[0167] Optionally, in one possible implementation, the prediction module 30 is specifically used for:

[0168] Feature extraction is performed on the target 3D point cloud to obtain the high-dimensional features corresponding to each 3D point in the target 3D point cloud;

[0169] Multiple 3D points in the target 3D point cloud are identified as second candidate points, and the target approach direction corresponding to each second candidate point is determined.

[0170] For each second candidate point, a neighborhood point set corresponding to each second candidate point is constructed within the target 3D point cloud, with the second candidate point as the center and the target approach direction corresponding to the second candidate point as the direction axis. High-dimensional features of each point in the neighborhood point set are extracted to obtain a local orientation feature set.

[0171] For each second candidate point, perform a crawling quality prediction on the local directional feature set corresponding to the second candidate point to obtain a crawling quality prediction score for each second candidate point. Based on the crawling quality prediction scores for each second candidate point, select a preset number of second candidate points as first candidate points.

[0172] Optionally, in one possible implementation, the prediction module 30 is specifically used for:

[0173] Sparse convolution is performed on each 3D point in the target 3D point cloud and the neighboring points of each 3D point to obtain the local geometric features corresponding to each 3D point;

[0174] An encoder-decoder structure is used to aggregate the local geometric features corresponding to each 3D point at multiple scales to obtain multi-scale context features.

[0175] The multi-scale context features are mapped point-by-point to obtain a high-dimensional feature tensor that is aligned with the spatial resolution of the target 3D point cloud; the high-dimensional feature tensor includes the high-dimensional features corresponding to each 3D point.

[0176] Optionally, in one possible implementation, the prediction module 30 is specifically used for:

[0177] The feasibility of crawling is evaluated for the high-dimensional features corresponding to each 3D point, and an evaluation score is obtained for each 3D point. The evaluation score is used to indicate the feasibility of using the 3D point as the crawling center.

[0178] The evaluation score corresponding to each 3D point is compared with a preset score threshold to determine the 3D points whose evaluation scores are greater than the preset score threshold.

[0179] Each 3D point whose evaluation score is greater than the preset score threshold is identified as the second candidate point.

[0180] Optionally, in one possible implementation, the prediction module 30 is specifically used for:

[0181] Based on the local orientation feature set corresponding to each first candidate point, predict multiple grasping attributes corresponding to each first candidate point; the multiple grasping attributes include in-plane rotation angle, grasping depth and gripper width;

[0182] The candidate grasping pose corresponding to each first candidate point is obtained by combining each first candidate point, the target approach direction corresponding to each first candidate point, and multiple grasping attributes.

[0183] Optionally, in one possible implementation, module 20 is specifically used for:

[0184] Obtain real-time user capture commands;

[0185] The user's grabbing command and the 2D image of the environment are input into the visual language model to obtain the target object region in the 2D image of the environment. The visual language model is used to extract the deep target semantics corresponding to the user's grabbing command and the image spatial features corresponding to the 2D image of the environment. The deep target semantics and the image spatial features are matched across modes, and the target object region in the 2D image of the environment is located based on the matching results.

[0186] Optionally, in one possible implementation, the evaluation module 40 is specifically used for:

[0187] Construct a virtual gripper corresponding to each candidate grasping pose and determine the coordinates of multiple key points on each virtual gripper; the multiple key points include at least the two farthest endpoints on the gripper base along the gripper opening and closing direction, and the tip of each gripper finger;

[0188] Based on the camera intrinsic parameter matrix, the coordinates of multiple key points on each virtual gripper are projected onto the 2D image of the environment, and the corresponding gripping wireframe markers are generated based on the projection results to obtain the gripper projection image corresponding to each virtual gripper; the camera intrinsic parameter matrix is ​​the intrinsic parameter matrix of the camera used to acquire the 2D image of the environment;

[0189] Based on the candidate grasping pose corresponding to each virtual gripper, determine the 6D pose parameters corresponding to each virtual gripper.

[0190] For each virtual gripper, a multimodal evaluation is performed on the candidate gripping pose based on the gripper projection image corresponding to the virtual gripper, the 6D pose parameters corresponding to the virtual gripper, the preset joint limit threshold of the robot's robotic arm, and the user's gripping command, so as to obtain the evaluation result corresponding to the candidate gripping pose.

[0191] Optionally, in one possible implementation, the evaluation module 40 is specifically used to: construct a first evaluation sub-item based on the gripper projection image corresponding to the virtual gripper;

[0192] Based on the 6D pose parameters corresponding to the virtual gripper and the preset joint limit threshold of the robot's robotic arm, a second evaluation sub-item is constructed.

[0193] Based on the gripper projection image corresponding to the virtual gripper and the user's grasping command, a third evaluation sub-item is constructed;

[0194] Based on the first evaluation sub-item, the second evaluation sub-item, and the third evaluation sub-item, a multi-dimensional joint evaluation function is constructed;

[0195] Based on the multi-dimensional joint evaluation function, the candidate grasping pose is evaluated in a multimodal manner.

[0196] Optionally, in one possible implementation, the device further includes a verification module, specifically used for:

[0197] Based on the user's grasping command, a corresponding verification prompt is generated; the verification prompt is used to instruct the visual language model to perform a verification task on the robot's gripper image; the verification task includes determining whether there is an object in the gripper and whether the object in the gripper is the target object corresponding to the user's grasping command.

[0198] The robot's current gripper image is acquired, and the current gripper image and verification prompt words are input into the visual language model to obtain the verification result corresponding to the current gripper image.

[0199] When the verification result indicates that there is an object in the gripper in the current gripper image, and the object in the gripper is the target object corresponding to the user's grasping command, the robot is determined to have successfully grasped the object, and the robot is controlled to perform a return-to-zero action.

[0200] If the verification result indicates that there is no object in the gripper in the current gripper image, or if there is an object in the gripper in the current gripper image but the object in the gripper is not the target object corresponding to the user's grasping command, the robot is determined to have failed to grasp. A grasping failure diagnosis result is generated based on the current gripper image so that the grasping strategy of the previous round can be dynamically adjusted according to the diagnosis result.

[0201] It should be noted that the above modules can be functional modules or program modules, and can be implemented through software or hardware. For modules implemented through hardware, the above modules can reside in the same processor; or the above modules can be located in different processors in any combination.

[0202] This embodiment also provides a robot, including a memory and a processor, the memory storing a computer program, and the processor being configured to run the computer program to perform the steps in any of the above method embodiments.

[0203] Optionally, the computer device may further include a transmission device and an input / output device, wherein the transmission device is connected to the processor and the input / output device is connected to the processor.

[0204] It should be noted that the specific examples in this embodiment can refer to the examples described in the above embodiments and optional implementations, and will not be repeated in this embodiment.

[0205] Furthermore, in conjunction with the robot grasping control method provided in the above embodiments, this embodiment can also provide a storage medium for implementation. The storage medium stores a computer program; when executed by a processor, the computer program implements any of the robot grasping control methods described in the above embodiments.

[0206] It should be understood that the specific embodiments described herein are merely illustrative of the application and not intended to limit it. All other embodiments derived by those skilled in the art based on the embodiments provided in this application without inventive effort are within the scope of protection of this application.

[0207] Obviously, the accompanying drawings are merely some examples or embodiments of this application. Those skilled in the art can apply this application to other similar situations based on these drawings without any creative effort. Furthermore, it is understood that although the work done in this development process may be complex and lengthy, for those skilled in the art, certain design, manufacturing, or production modifications made based on the technical content disclosed in this application are merely conventional technical means and should not be considered as insufficient disclosure of this application.

[0208] The term "embodiment" in this application refers to a specific feature, structure, or characteristic described in connection with an embodiment that may be included in at least one embodiment of this application. The appearance of this phrase in various places in the specification does not necessarily imply the same embodiment, nor does it imply that it is mutually exclusive with or independent of other embodiments. It will be clearly or implicitly understood by those skilled in the art that the embodiments described in this application may be combined with other embodiments without conflict.

[0209] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of patent protection. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the appended claims.

Claims

1. A robot grasping control method characterized by, The method includes: Acquire real-time environmental images; the environmental images include two-dimensional environmental images and environmental depth images; The target object region in the two-dimensional environmental image is determined, and a target three-dimensional point cloud corresponding to the target object region is generated based on the environmental depth image. Determine multiple first candidate points in the target 3D point cloud, and predict the candidate grasping pose corresponding to each first candidate point; Multimodal evaluation is performed on each candidate grasping pose to obtain the evaluation result corresponding to each candidate grasping pose, and the optimal grasping pose among each candidate grasping pose is determined based on the evaluation result corresponding to each candidate grasping pose. Based on the optimal grasping pose, the robot is controlled to perform the corresponding grasping action.

2. The robot grasp control method of claim 1, wherein, The step of determining multiple first candidate points in the target 3D point cloud includes: Feature extraction is performed on the target 3D point cloud to obtain the high-dimensional features corresponding to each 3D point in the target 3D point cloud; Multiple 3D points in the target 3D point cloud are identified as second candidate points, and the target approach direction corresponding to each second candidate point is determined. For each second candidate point, a neighborhood point set corresponding to each second candidate point is constructed within the target 3D point cloud, with the second candidate point as the center and the target approach direction corresponding to the second candidate point as the direction axis. High-dimensional features of each point in the neighborhood point set are extracted to obtain a local orientation feature set. For each second candidate point, perform a crawling quality prediction on the local orientation feature set corresponding to the second candidate point to obtain a crawling quality prediction score for each second candidate point, and select a preset number of second candidate points as first candidate points based on the crawling quality prediction scores for each second candidate point.

3. The robotic grasping control method of claim 2, wherein, The step of extracting features from the target 3D point cloud to obtain high-dimensional features corresponding to each 3D point in the target 3D point cloud includes: Sparse convolution is performed on each three-dimensional point in the target three-dimensional point cloud and the neighboring points of each three-dimensional point to obtain the local geometric features corresponding to each three-dimensional point; An encoder-decoder structure is used to perform multi-scale aggregation of the local geometric features corresponding to each of the three-dimensional points to obtain multi-scale context features. The multi-scale context features are subjected to point-by-point feature mapping to obtain a high-dimensional feature tensor aligned with the spatial resolution of the target 3D point cloud; the high-dimensional feature tensor includes the high-dimensional features corresponding to each of the 3D points.

4. The robot grasp control method of claim 2, wherein, The step of determining multiple 3D points in the target 3D point cloud as second candidate points includes: The feasibility of crawling is evaluated for the high-dimensional features corresponding to each 3D point, and an evaluation score is obtained for each 3D point; the evaluation score is used to indicate the feasibility of the 3D point as a crawling center. The evaluation score corresponding to each 3D point is compared with a preset score threshold to determine the 3D points whose evaluation scores are greater than the preset score threshold. Each 3D point whose evaluation score is greater than the preset score threshold is identified as the second candidate point.

5. The robotic grasp control method of claim 2, wherein, The prediction of the candidate grasping pose corresponding to each of the first candidate points includes: Based on the local orientation feature set corresponding to each first candidate point, multiple grasping attributes corresponding to each first candidate point are predicted; the multiple grasping attributes include in-plane rotation angle, grasping depth, and gripper width; The candidate grasping pose corresponding to each first candidate point is obtained by combining each first candidate point, the target approach direction corresponding to each first candidate point, and the multiple grasping attributes.

6. The robotic grasp control method of claim 1, wherein, Determining the target object region in the two-dimensional environmental image includes: Obtain real-time user capture commands; The user's grabbing command and the two-dimensional environmental image are input into a visual language model to obtain the target object region in the two-dimensional environmental image. The visual language model is used to extract the deep target semantics corresponding to the user's grabbing command and the image spatial features corresponding to the two-dimensional environmental image. The deep target semantics and the image spatial features are matched across modally, and the target object region in the two-dimensional environmental image is located based on the matching result.

7. The robotic grasp control method of claim 6, wherein, The step of performing multimodal evaluation on each candidate grasping pose to obtain the evaluation result corresponding to each candidate grasping pose includes: Construct a virtual gripper corresponding to each candidate grasping pose, and determine the coordinates of multiple key points on each virtual gripper; the multiple key points include at least the two farthest endpoints on the gripper base along the gripper opening and closing direction, and the tip of each gripper finger; Based on the camera intrinsic parameter matrix, the coordinates of multiple key points on each virtual gripper are projected onto the two-dimensional image of the environment, and corresponding gripping wireframe markers are generated based on the projection results to obtain the gripper projection image corresponding to each virtual gripper; the camera intrinsic parameter matrix is ​​the intrinsic parameter matrix corresponding to the camera used to acquire the two-dimensional image of the environment; Based on the candidate grasping pose corresponding to each virtual gripper, determine the 6D pose parameters corresponding to each virtual gripper. For each virtual gripper, a multimodal evaluation is performed on the candidate gripping pose based on the gripper projection image corresponding to the virtual gripper, the 6D pose parameters corresponding to the virtual gripper, the preset joint limit threshold of the robot's robotic arm, and the user's gripping command, to obtain the evaluation result corresponding to the candidate gripping pose.

8. The robotic grasping control method of claim 7, wherein, The multimodal evaluation of the candidate grasping pose based on the gripper projection image corresponding to the virtual gripper, the 6D pose parameters corresponding to the virtual gripper, the preset joint limit threshold of the robot's robotic arm, and the user's grasping command includes: Based on the gripper projection image corresponding to the virtual gripper, a first evaluation sub-item is constructed; Based on the 6D pose parameters corresponding to the virtual gripper and the preset joint limit threshold of the robot's robotic arm, a second evaluation sub-item is constructed; Based on the gripper projection image corresponding to the virtual gripper and the user's grasping command, a third evaluation sub-item is constructed; Based on the first evaluation sub-item, the second evaluation sub-item, and the third evaluation sub-item, a multi-dimensional joint evaluation function is constructed; The candidate grasping pose is evaluated using the multi-dimensional joint evaluation function.

9. The robotic grasping control method of any one of claims 6-8, wherein, After controlling the robot to perform the corresponding grasping action according to the optimal grasping pose, the method further includes: Based on the user's grasping command, a corresponding verification prompt is generated; the verification prompt is used to instruct the visual language model to perform a verification task on the robot's gripper image; the verification task includes determining whether there is an object in the gripper and whether the object in the gripper is the target object corresponding to the user's grasping command; The robot acquires a current gripper image and inputs the current gripper image and the verification prompt word into a visual language model to obtain the verification result corresponding to the current gripper image. When the verification result indicates that there is an object in the gripper in the current gripper image, and the object in the gripper is the target object corresponding to the user's grasping command, the robot is determined to have successfully grasped the object, and the robot is controlled to perform a return-to-zero action. If the verification result indicates that there is no object in the gripper in the current gripper image, or if there is an object in the gripper in the current gripper image but the object in the gripper is not the target object corresponding to the user's grasping command, the robot is determined to have failed to grasp, and a grasping failure diagnosis result is generated based on the current gripper image, so as to dynamically adjust the previous grasping strategy based on the diagnosis result.

10. A robot grasping control device, characterized by, The device includes: The acquisition module is used to acquire real-time environmental images; the environmental images include two-dimensional environmental images and environmental depth images. The generation module is used to determine the target object region in the two-dimensional environmental image and generate a target three-dimensional point cloud corresponding to the target object region based on the environmental depth image. The prediction module is used to determine multiple first candidate points in the target 3D point cloud and predict the candidate grasping pose corresponding to each first candidate point. The evaluation module is used to perform multimodal evaluation on each of the candidate grasping poses, obtain the evaluation result corresponding to each candidate grasping pose, and determine the optimal grasping pose among the candidate grasping poses based on the evaluation result corresponding to each candidate grasping pose. The control module is used to control the robot to perform the corresponding grasping action based on the optimal grasping pose.