Robotically-operated grasping method, system, and storage medium based on vla
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ANHUI UNIV
- Filing Date
- 2026-06-01
- Publication Date
- 2026-06-30
AI Technical Summary
Existing robotic grasping systems struggle to understand natural language commands, have poor adaptability to complex scenarios, lack grasping stability, and suffer from high collision risks and low system coupling.
A VLA-based robot grasping method is adopted. The language command is parsed by a large vision-language-motion model, and the environmental images acquired by the depth camera are used for target detection and segmentation to generate multiple candidate grasping poses. Collision detection and stability evaluation are performed, and finally the optimal grasping pose is selected and the grasping operation is executed.
It enables robots to grasp autonomously, stably, and safely in unknown or complex scenarios, improves task semantic understanding and grasping generalization performance, reduces the risk of invalid grasping and collisions, and is applicable to various types of robotic arm platforms under the ROS control framework.
Smart Images

Figure CN122299682A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of robot intelligent operation and embodied intelligence, specifically to a robot operation grasping method, system, and storage medium based on VLA. Background Technology
[0002] As industrial and service robots expand into complex and unstructured environments, traditional rule-based and geometric modeling-based robotic arm grasping methods are gradually revealing problems such as poor generalization ability, reliance on manual modeling, and difficulty in understanding task semantics. Most existing robotic arm grasping systems require manual specification of the grasping target and pose, making it difficult to directly respond to natural language commands and achieve stable grasping in situations where the target is unknown or the scene is complex.
[0003] In recent years, the development of Large-Scale Language Models (LLMs) and Vision-Language Models (VLMs) has provided new technical pathways for robot manipulation. By introducing language understanding and visual semantic alignment capabilities, robots can acquire the ability to understand task objectives, operational intentions, and scene semantics. However, existing research mostly focuses on the perception or planning stages, lacking a unified operational framework that deeply integrates language understanding, visual segmentation, grasping detection, and physical execution. This is particularly problematic in practical robot systems, where issues such as insufficient grasping stability, high collision risk, and low system coupling persist.
[0004] Therefore, there is an urgent need for a robot grasping method that can integrate language understanding, visual perception and action execution to achieve closed-loop control from natural language commands to real robot grasping actions. Summary of the Invention
[0005] This invention aims to provide a robot grasping method and system based on VLA (Verbal Language Analysis) to solve the technical problems of existing robot grasping systems, such as difficulty in understanding language commands, poor adaptability to complex scenarios, and insufficient grasping stability, so as to realize autonomous, stable, and safe grasping of robots in unknown environments based on language commands.
[0006] This invention solves the above-mentioned technical problems through the following technical solution: a robot manipulation and grasping method based on VLA, the method comprising the following steps:
[0007] S1. Language instruction parsing: Receives natural language capture instructions from the user, parses the instructions using the vision-language-action big model, extracts the semantic description, operational constraints and task intent of the target object, and generates a structured task representation.
[0008] S2. Visual perception and target segmentation: Environmental images are acquired by depth cameras installed at the end of the robotic arm and the robot body. The language prompts are input into the open vocabulary visual segmentation model to perform target detection and semantic segmentation on the environmental images and obtain the target object mask region corresponding to the language command.
[0009] S3. Target point cloud generation and grasp detection: Based on the semantic segmentation mask, extract the target object point cloud from the depth image, input the target point cloud into the grasp detection model, generate multiple candidate grasp poses, and calculate the grasp score for each candidate grasp pose.
[0010] S4. Grasping stability assessment and collision detection: Collision detection is performed on the generated candidate grasping poses to eliminate those that pose a risk of collision with the environment or the robot itself; at the same time, the stability of the remaining candidate grasping poses is scored based on the object's center of gravity distribution, and they are ranked according to the comprehensive score.
[0011] S5. Optimal gripping pose selection and execution: Select the gripping pose with the highest score, convert it to the robot's base coordinate system, generate the corresponding motion control command, and drive the robotic arm to perform the gripping operation through the ROS operating system.
[0012] On the other hand, the present invention also provides a VLA-based robot manipulation and grasping system, comprising:
[0013] One or more processors;
[0014] Memory, used to store one or more programs;
[0015] When the one or more programs are executed by the one or more processors, the system performs the VLA-based robot grasping method as described above.
[0016] In another aspect, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the VLA-based robot grasping method as described above.
[0017] The positive and progressive effects of this invention are as follows:
[0018] 1. By introducing the semantic understanding and cross-modal alignment capabilities of the vision-language-action large model, end-to-end linkage of natural language commands, visual perception and robotic arm grasping actions is achieved. Even under unknown or complex scene conditions, target recognition and grasping planning can still be stably completed. This breaks through the dependence of traditional grasping methods on manual target setting and fixed rule strategies, and significantly improves the robot's ability to understand task semantics, grasping generalization performance and autonomous operation level.
[0019] 2. The system integrates open-vocabulary object detection and general segmentation models to generate high-precision semantic masks. Combined with point cloud-based dense grasping detection, collision detection, and grasping stability evaluation mechanisms, it effectively reduces the risk of invalid grasping and collisions in cluttered environments. Compared with traditional methods based on single visual or geometric features, the grasping success rate and execution stability are improved. Moreover, the system can be directly deployed on various types of robotic arm platforms under the ROS control framework, demonstrating good versatility and engineering feasibility. Attached Figure Description
[0020] Figure 1 This is a flowchart of the method of the present invention.
[0021] Figure 2 This is a schematic diagram of a specific experimental scenario for the present invention.
[0022] Figure 3 This is a structural block diagram of the local terminal of an exemplary electronic device of the present invention.
[0023] Figure 4 This is a structural block diagram of the network terminal of an exemplary electronic device of the present invention. Detailed Implementation
[0024] The present invention will be further illustrated by way of embodiments below, but the present invention is not limited to the scope of the embodiments.
[0025] Example 1:
[0026] See Figures 1 to 4 A VLA-based robot grasping method, comprising the following steps:
[0027] S1, Language Instruction Parsing;
[0028] S11. Preprocessing Language Instructions: Receives natural language capture instructions from the user, performs necessary preprocessing on the instructions, including removing redundant modifiers, standardizing expressions, and completing missing information, in order to eliminate ambiguity and ensure that the instructions are semantically clear.
[0029] S12. Semantic parsing: Input the preprocessed natural language instructions into the Vision-Language-Action (VLA) large model, parse the target object semantics, operational constraints and task intent in the instructions, and generate a structured task representation for subsequent visual recognition and grasping planning.
[0030] S2, Visual Perception and Target Segmentation;
[0031] S21. Image Acquisition: Acquire color and depth images of the working environment using depth cameras mounted at the end of the robotic arm and on the robot body;
[0032] S22, Object Detection: Input the language cues in the structured task representation generated in S1 into the open vocabulary visual segmentation model to perform object detection on the environmental image and obtain the target candidate regions corresponding to the language semantics;
[0033] S23. Target Segmentation: Based on the target candidate regions obtained in S22, a refined segmentation mask consistent with natural language retrieval instructions is generated using a cue-driven image segmentation model. Its internal processing specifically includes:
[0034] S231, Image Feature Extraction: The segmentation model encodes the input image I using a hierarchical visual Transformer to generate multi-scale visual features:
[0035] ;
[0036] in: Hierarchical Vision Encoder; Image semantic features of layer l The height of the feature map of layer l. The width of the feature map of layer l. The number of channels in the feature map of layer l; Feature maps representing different levels.
[0037] S232, Cue Encoding: Based on the target candidate region B obtained in S22, the segmentation cues are input to the cues encoder of the segmentation model, transforming B into cue features: ;
[0038] S233, Mask Prediction: This step inputs the cue features and image features into the attention fusion module of the segmentation model. Its core calculation is as follows:
[0039] ;
[0040] in: : Queries generated from cue features, where Wq is a learnable weight matrix and P is the cue features; : Keys and values generated from image features and the model's internal memory; The normalized activation function; Feature dimension. The fused features. Generate a probability mask for the target region using a mask decoder:
[0041] ;
[0042] The prediction results are the candidate target masks to be screened. Sigmoid activation function For mask decoder, This is a feature of fusion.
[0043] S234, Optimal Mask Output: Multiple mask candidates and their corresponding quality scores generated based on S233. The quality score is globally represented using a mask. calculate:
[0044]
[0045] For learnable weights, the one with the highest score is selected as the final output mask:
[0046] ;
[0047] in, For multiple candidate probability masks.
[0048] S3. Target point cloud generation and capture detection;
[0049] S31. Target point cloud generation: Based on the semantic segmentation mask of the target object obtained in S2. First, extract the set of pixels inside the mask from the depth image:
[0050] ;
[0051] Then, using depth values And the camera intrinsic parameter matrix:
[0052] ;
[0053] The aforementioned pixels are back-projected into three-dimensional space. The back-projection relationship of the target point's three-dimensional coordinates is calculated as follows:
[0054] ;
[0055] The focal length of the camera along the x-axis and y-axis. The principal point coordinates of the camera, i.e., the projection point of the optical axis onto the imaging plane; , , : The three-dimensional spatial coordinates corresponding to pixel (u,v);
[0056] Thus, the three-dimensional point cloud set of the target object can be obtained:
[0057] ;
[0058] S32. Feature Extraction and Spatial Analysis: Extracting the point cloud of the target object. The input is a point cloud-based grasping and detection model. Through two stages—feature extraction and spatial analysis—it completes point cloud structure understanding and grasping feasibility modeling, specifically including:
[0059] S321. Point Cloud Geometric Feature Extraction: Based on the target 3D point cloud generated by S31, the grasping and detection model extracts features from the point cloud's local curvature, surface normals, geometric structure, and overall spatial relationships. The encoding process forms a multi-scale semantic feature representation of the point cloud.
[0060] ;
[0061] in The point cloud geometric feature encoder used for capturing and detecting the model; It is a high-dimensional feature set of point clouds, containing local and global spatial semantic information.
[0062] S322, Spatial Analysis: Point Cloud Features Based on S321 The grasping detection model then performs grasping-related spatial geometric analysis on the point cloud to predict the feasibility of contact with the object surface and the grasping depth. The grasping depth is calculated as follows:
[0063]
[0064] in, The predicted depth of the grab center point in the camera coordinate system; The depth of the corresponding pixel in the point cloud; Here are the pixel coordinates on the image plane, where u is the horizontal direction and v is the vertical direction; To capture the depth of the pose relative to the object's surface.
[0065] S33, Candidate Grasping Pose Generation: Based on the point cloud features formed in S321 The set of internal parameters generated by the capture and detection model for each candidate is represented as follows: ;
[0066] in, To capture the center point position; To capture the normal direction; The width of the gripper opening; This represents the rotation angle around the normal. 7-DoF grasping pose representation. :
[0067] ;
[0068] in For the corresponding 3×3 rotation matrix To capture the 3D position of the center in the camera coordinate system, Let be the minimum opening width of the gripper. The entire candidate set is denoted as:
[0069] ;
[0070] Furthermore, the crawling detection model generates an internal parameter set. At the same time, a grab score (grasp_score) is generated for each candidate grab pose. This score is used to quantify the feasibility and success probability of grasping in this grasping posture. It is obtained directly from network regression based on the geometric structure and local contact characteristics of the target point cloud, and the values are normalized.
[0071] S4. Grasping stability assessment and collision detection;
[0072] S41. Collision Detection: To facilitate efficient collision detection, this step simplifies the mechanical gripper model. Specifically, the mechanical gripper is abstractly modeled as three regular cubes, corresponding to the fingertip areas of the two grippers and the fixed middle part of the gripper, respectively. This simplified model reduces computational complexity while maintaining the key geometric relationships of gripping.
[0073] Based on the grasping pose obtained in S33 Based on its pose parameters, a simplified mechanical gripper model (three regular cubes) is geometrically transformed to obtain the actual area occupied by the gripper in three-dimensional space during the grasping process. Subsequently, based on the environmental point cloud... (Constructed from a depth camera, including the scene background and the robot body) Performs geometric cross-detection.
[0074] If capturing pose If the area occupied by the gripper overlaps with or contacts any point in the environmental point cloud, it is determined that there is a collision risk at that pose. This can be represented as:
[0075] ;
[0076] in: The gripper model consisting of three cubes is in pose. The volume region below. Grasping poses that meet the above conditions will be eliminated, thus ensuring that subsequent grasping only occurs within a safe space.
[0077] S42. Grasping Stability Assessment: The grasping detection model uses a stability score as a quantitative metric to reflect the stability of the grasping pose relative to the object's center of mass (COG). Details are as follows:
[0078] S421. Assuming the target object is a rigid body with uniform density, based on the target point cloud obtained in S31... Estimate the object's center of mass:
[0079] ;
[0080] S422. Transform the object's centroid from S421 to the gripper coordinate system, which is determined by the gripping pose. definition:
[0081] , To capture pose Translation vector in To capture pose Transpose of a rotation matrix;
[0082] S423. Calculate the perpendicular distance from the centroid to the gripper plane, and normalize the data to obtain the stability score:
[0083] ;
[0084] A smaller stablescore indicates that the grasping pose is closer to COG and the grasping is more stable. The z-axis coordinate component of the centroid in the gripper coordinate system.
[0085] S43. Comprehensive Ranking and Final Output: Based on the collision detection candidate poses eliminated in S41, the model combines the predicted grassscore and stablescore during inference to calculate and rank the final comprehensive score. The final score calculation formula is as follows:
[0086] ;
[0087] S5. Optimal grasping pose selection and execution;
[0088] S51. Optimal Grasping Pose Selection: Based on the candidate grasping pose set obtained in S43 and its corresponding comprehensive score. The system selects the grasping pose with the highest overall score as the final grasping pose used for execution, denoted as:
[0089]
[0090] The optimal grasping pose It combines the grasping feasibility (graspscore) based on network prediction, the grasping stability (stablescore) under the constraint of the object's center of mass, and the operational safety ensured by explicit collision detection, which can effectively guarantee the success rate and reliability of the robotic arm during execution.
[0091] S52, Pose Coordinate System Transformation and Motion Control Command Generation: Transform the optimal grasping pose... Transform from camera coordinate system to robot base coordinate system. The optimal rotation matrix for grasping is given to form a grasping reference pose that the robotic arm can execute. Let the homogeneous transformation matrix from the camera coordinate system to the robot base coordinate system be:
[0092] ;
[0093] The rotation matrix from the camera to the base. Let be the translation vector from the camera to the base. Then, the optimal grasping pose in the robot base coordinate system is represented as:
[0094] ,in, The optimal capture pose in the camera coordinate system;
[0095] in,
[0096]
[0097] It can be directly used in the motion planning module. Subsequently, according to the requirements of the robotic arm control interface, the pose is converted into control commands that the robot can execute, including the target pose and gripper opening and closing parameters. And proximity trajectory parameters, etc., are used to drive the robotic arm to perform grasping actions.
[0098] S53, Grasping Execution: The corresponding motion control nodes (including trajectory planning nodes, motion controller nodes, and gripper control nodes) are activated through the ROS robot operating system to send the control commands generated in S52 to the robotic arm. This achieves VLA-based robot grasping.
[0099] Example 2:
[0100] This invention can also be implemented using the following technical solutions, including:
[0101] S1. Language instruction parsing and semantic / geometric constraint generation:
[0102] The system receives natural language grasping instructions from the user, parses the instructions using a large vision-language-action model, extracts the target object's category description, color description, shape description, spatial relationship description, target part description, grasping action intent, prohibited contact areas, and operation safety constraints, and generates a structured task representation.
[0103] The structured task representation includes target semantic field, target part field, grasping method field, posture constraint field, obstacle avoidance constraint field, contact force constraint field, and task priority field.
[0104] Furthermore, when there is ambiguity in the natural language grasping instructions, the vision-language-action big model generates a list of candidate targets based on the candidate objects in the environmental image and calculates the semantic matching degree between each candidate target and the language instruction. When the highest semantic matching degree is lower than a preset threshold, or the difference between the highest semantic matching degree and the second highest semantic matching degree is lower than a preset difference threshold, a secondary visual confirmation or re-parsing mechanism is triggered to avoid misgrabbing due to misunderstanding of the target.
[0105] S2. Multi-view visual perception and open-vocabulary target segmentation:
[0106] RGB-D images of the environment are acquired by a first depth camera mounted at the end of the robotic arm and a second depth camera mounted on the robot body. Based on the robot's hand-eye calibration parameters, the depth images from different perspectives are unified to the robot's base coordinate system. The target semantic field and target part field in the structured task representation are converted into language prompts and input into an open vocabulary visual detection model and a general segmentation model to perform target detection, part localization and semantic segmentation on the environmental images, thereby obtaining the target object mask region corresponding to the language command.
[0107] Furthermore, consistency verification is performed on the target object masks obtained from multiple viewpoints, and a mask confidence map is generated based on the reprojection error of the mask boundary, segmentation confidence, and depth continuity under different viewpoints. The mask confidence map is used to identify the degree of confidence that each pixel in the target region belongs to the target object, and serves as the input parameter for subsequent point cloud extraction and capture scoring.
[0108] S3. Target Point Cloud Generation and Uncertainty Modeling:
[0109] Based on the target object mask region and mask confidence map, the target object point cloud is extracted from the depth image, and outlier removal, normal vector estimation, voxel downsampling and multi-view point cloud fusion are performed on the target object point cloud to obtain the target object fused point cloud.
[0110] In the above steps, each point cloud point is associated with the segmentation confidence, depth measurement noise and multi-view reprojection error of the corresponding pixel to generate the uncertainty weight of the target point cloud. For mask boundary regions, depth missing regions and occluded regions, the weight of the corresponding point cloud points in the grasping detection process is reduced, thereby reducing the impact of incorrect segmentation, background point mixing or missing occluded points on grasping pose generation.
[0111] S4. Candidate grasping pose generation and language constraint filtering:
[0112] The target object is fused into a point cloud and input into the grasping detection model to generate multiple candidate grasping poses; each candidate grasping pose includes the gripper center position, gripper approach direction, gripper opening and closing direction, gripper width, grasping depth, and grasping confidence.
[0113] Furthermore, based on the target part field, prohibited contact area, and posture constraint field in the structured task representation, the candidate grasping poses are filtered for semantic-geometric consistency. When the user instruction includes restrictions such as "grab the handle", "avoid the blade", "grab the middle of the bottle", and "do not press the button", the corresponding part or area is converted into a permitted contact area or a prohibited contact area in three-dimensional space, and candidate grasping poses that fall into the prohibited contact area, whose gripper direction does not meet the posture constraint, or whose gripper width does not meet the local geometric dimensions of the object are eliminated.
[0114] S5. Collision detection, accessibility detection, and center of gravity stability assessment:
[0115] Collision detection is performed on candidate grasping poses to determine whether the gripper, robotic arm link, and target object collide with environmental obstacles, tabletops, adjacent objects, or the robot itself in the grasping path; candidate grasping poses with collision risk are eliminated.
[0116] The remaining candidate grasping poses are accessibility tested, and the robotic arm kinematic model is used to determine whether the robotic arm can reach the corresponding grasping pose under the constraints of joint angle limit, velocity limit and singular configuration.
[0117] Furthermore, the geometric center, support region, and approximate centroid distribution of the target object are estimated based on the fused point cloud. When the material properties or mass distribution of the target object are unknown, the centroid region is estimated based on the point cloud voxel density, shape symmetry, and the object category in the user command. The offset between the gripper contact center of the candidate grasping pose and the projection of the target object's centroid is calculated, and a grasping stability score is generated by combining the gripper contact normal, friction cone constraint, and force closure index.
[0118] S6. Comprehensive scoring ranking and active re-observation mechanism:
[0119] For each remaining candidate grasping pose, a comprehensive score is calculated. The comprehensive score includes at least the grasping confidence, target semantic matching degree, mask confidence, point cloud uncertainty, collision margin, robotic arm reachability, center of gravity stability, and task constraint matching degree output by the grasping detection model.
[0120] The following scoring methods may be used:
[0121]
[0122] in, Indicates the first The overall score of each candidate capture pose. This represents the crawl confidence score output by the crawl detection model. This represents the target semantics and mask confidence. Indicates the collision margin score. This indicates the reachability score of the robotic arm. Indicates the stability score of the center of gravity. Indicates the degree of matching between task constraints. This represents the point cloud uncertainty score. to These are the weighting coefficients.
[0123] When the highest comprehensive score is lower than the preset execution threshold, or the difference in comprehensive scores between the top two candidate grasping poses is less than the preset safety difference, the end-effector camera of the robotic arm is controlled to move to the observation pose with the greatest information gain, the RGB-D image of the target object is re-acquired, and the steps of target segmentation, point cloud fusion and candidate grasping pose generation are repeated to improve the grasping reliability in occluded or cluttered scenes.
[0124] S7. Optimal grasping pose selection, closed-loop correction and execution:
[0125] The candidate grasping pose with the highest comprehensive score is selected as the optimal grasping pose. It is then transformed from the camera coordinate system to the robot base coordinate system, and the pre-grasping pose, approach path, closing gripper action, and extraction path are generated based on the robot arm kinematic model.
[0126] After the robotic arm moves to the pre-grasping pose, it re-acquires local images of the target object through the end-effector depth camera and performs secondary correction on the position and orientation of the target object. When the target object is displaced or the deviation between the candidate grasping pose and the current target point cloud exceeds a preset threshold, local replanning is performed. When the deviation is within the allowable range, the gripper is controlled to move along the approach direction of the optimal grasping pose and close the gripper to complete the grasping operation.
[0127] In this embodiment, further, it may also include comprehensive scoring data processing of candidate grasping poses;
[0128] The comprehensive score of the candidate grasp pose is calculated based on the grasp confidence, language semantic matching degree, mask confidence, point cloud confidence, collision margin score, robotic arm reachability score, grasp stability score, task constraint matching degree, and comprehensive uncertainty score output by the grasp detection model.
[0129] In the above steps, the system establishes a corresponding candidate pose data item for each candidate grasping pose. The candidate pose data item includes at least the candidate pose number, the coordinates of the gripper center point, the gripper approach direction, the gripper opening and closing direction, the gripper width, the grasping depth, the grasping detection confidence, the corresponding target object number, the corresponding semantic mask number, the contact point cloud set, the collision detection result, the inverse kinematics solution result of the robotic arm, the stability evaluation result, and the task constraint matching result.
[0130] For the grasp confidence score output by the grasp detection model, the system first determines whether the candidate grasp pose belongs to a valid grasp pose directly generated by the grasp detection model. If the gripper opening width, grasping depth, or approach direction of the candidate grasp pose does not meet the gripper structure constraints, the candidate grasp pose is marked as an invalid pose and will not be included in the subsequent sorting; if the gripper structure constraints are met, its model output confidence score is retained and used as basic grasp feasibility data.
[0131] For semantic matching, the system matches the target object mask corresponding to the candidate crawling pose with the target semantic fields obtained from parsing the natural language command. If the category, color, shape, text identifier, functional attributes, or spatial relationship of the target object is consistent with the user command, the semantic matching level of the candidate crawling pose is increased; if the target object only meets some semantic conditions, its semantic matching level is decreased; if the target object is inconsistent with the target description in the user command, the corresponding candidate crawling pose is removed from the candidate set.
[0132] For mask confidence, the system reads pixel confidence, boundary stability, multi-view reprojection consistency, and depth continuity data within the mask region corresponding to the candidate grasping pose. If the gripper contact area of the candidate grasping pose is located in the center region of the mask, and this region is identified as the same target object in multiple views, its mask reliability level is increased; if the gripper contact area is located at the mask boundary, occlusion edge, or semantically unstable region, its mask reliability level is decreased.
[0133] For point cloud confidence, the system extracts the gripper contact point and its neighborhood point cloud corresponding to the candidate grasping pose, and reads the depth validity, normal vector stability, voxel density, multi-view fusion consistency, and occlusion risk label for each point cloud point. If the point cloud in the contact area is continuous, the local surface normal is stable, and the depth data is complete, the candidate grasping pose is marked as a high point cloud confidence pose; if there are depth holes, outliers, sparse points, or low-confidence points inferred from occlusion in the contact area, the point cloud confidence level of the candidate grasping pose is reduced.
[0134] For collision margin scoring, the system reads the results of static collision detection of the gripper, collision detection of the gripper closed path, collision detection of the robotic arm linkage, and collision detection of the target object carried. If the candidate grasping pose does not interfere with obstacles, tabletops, adjacent objects, or the robot body during pre-grab, approach, gripping, lifting, and retraction, and maintains a safe clearance with obstacles, the collision safety level of the candidate grasping pose is increased; if the candidate grasping path approaches obstacles, passes through unknown spaces, or has a slight risk of interference, its collision safety level is decreased; if a clear collision is detected, the candidate grasping pose is eliminated.
[0135] For the robotic arm reachability scoring, the system calls the robotic arm inverse kinematics solution module to determine whether the robotic arm can reach the candidate grasping pose within the constraints of joint angle, end-effector pose, velocity, and singular configuration. If multiple feasible joint solutions exist, and the motion path is smooth with sufficient joint margin, the reachability level is increased; if only a critically feasible solution exists, or the robotic arm is close to joint limits, singular configurations, or motion space boundaries, the reachability level is decreased; if there is no inverse kinematics solution, the candidate grasping pose is eliminated.
[0136] For gripping stability scoring, the system judges based on the contact point position of the candidate gripping pose, the point cloud distribution of the target object, the estimated center of gravity region of the target object, the local surface normal, the contact area, and the gripper closing direction. If the gripper contact point is distributed near the center of gravity of the target object, the contact surface is stable, the gripping direction can form a reliable grip, and the target object is not easy to flip or slip after gripping, the stability level is improved; if the gripper contact point is off-center, the contact area is too small, the surface curvature is too large, or the posture is easy to lose balance after gripping, the stability level is reduced.
[0137] For task constraint matching, the system reads the target part field, prohibited contact field, posture constraint field, and force control constraint field from the structured task representation. If the gripper contact area of the candidate grasping pose falls within the allowed contact area and does not pass through the prohibited contact area, and the posture of the grasped target object meets the user's instruction requirements, the task constraint matching level is increased; if the candidate grasping pose only meets some task constraints, its matching level is decreased; if the candidate grasping pose violates the prohibited contact constraint, such as the gripper contacting a blade, screen, button, or opening area, the candidate grasping pose is directly eliminated.
[0138] For the comprehensive uncertainty score, the system summarizes the uncertainty of target semantic recognition, mask boundary uncertainty, point cloud missing degree, occlusion degree, unknown space ratio, and candidate pose ranking stability. If a candidate grasping pose depends on a low-confidence visual region, an occluded inference region, or an unknown spatial path, its uncertainty level is increased, and its priority is reduced during comprehensive ranking; if the data source corresponding to the candidate grasping pose is stable, the target recognition is clear, the point cloud is complete, and the path space is sufficiently observed, its uncertainty level is reduced, and its execution priority is increased.
[0139] It should be noted that after completing the above processing, the system generates a comprehensive evaluation record for each candidate grasping pose. This comprehensive evaluation record includes basic grasping feasibility, target semantic consistency, perception reliability, spatial safety, motion accessibility, gripping stability, task constraint compliance, and uncertainty risk. The system sorts the candidate grasping poses according to a preset priority rule, resulting in a ranked list of candidate grasping poses.
[0140] As a further optimized technical solution of this embodiment, it may also include data processing for re-observation triggering and re-observation pose determination. When the highest comprehensive score is lower than the preset execution threshold, or the target point cloud missing ratio is higher than the preset missing threshold, or the proportion of candidate grasping paths passing through unknown space is higher than the preset ratio, the re-observation pose is determined based on the benefits of mask uncertainty reduction, point cloud integrity improvement, grasping candidate discrimination improvement, and collision unknown space reduction. The end-effector camera of the robotic arm is then controlled to move to the re-observation pose to re-acquire the environmental RGB-D image.
[0141] In the above steps, the system first reads the candidate grasping pose ranking list and determines whether the candidate grasping pose ranked first meets the execution conditions. If the candidate grasping pose ranked first is in the first position, but its comprehensive evaluation record shows insufficient perception reliability, severe point cloud missingness, insufficient collision margin, high proportion of unknown space, or low stability level, then the grasping will not be executed immediately, but will instead enter the re-observation judgment process.
[0142] The system establishes re-observation trigger markers during the re-observation judgment process. These re-observation trigger markers include at least mask uncertainty trigger markers, point cloud missing trigger markers, unknown space trigger markers, candidate pose proximity trigger markers, and task constraint conflict trigger markers.
[0143] 1. When the target object mask boundary is unstable, the reprojection deviation of the mask area is large under different viewpoints, or the segmentation boundary between the target object and adjacent objects is unclear, an uncertain mask trigger mark is generated.
[0144] 2. When there are obvious gaps in the point cloud of the target object, the point cloud in the contact area is sparse, the area behind the target object is not observed, or the target object is occluded by other objects, a point cloud missing trigger mark is generated.
[0145] 3. When the candidate grasping path, gripper closed area, robotic arm approach path, or target object lifting path passes through unknown space, an unknown space trigger marker is generated.
[0146] 4. When the comprehensive evaluation results of multiple candidate grasping poses that rank highly are close, and the system cannot stably determine the optimal grasping pose, a candidate pose proximity trigger flag is generated.
[0147] 5. When there is an uncertain matching relationship between the candidate grab pose and the target part, prohibited contact area, posture maintenance requirement or force control requirement in the user instruction, a task constraint conflict trigger flag is generated.
[0148] Specifically, after generating the re-observation trigger flag, the system generates multiple candidate re-observation poses based on the current position of the end-effector camera, the spatial position of the target object, the distribution of environmental obstacles, and the range of motion of the robotic arm. Each candidate re-observation pose includes the camera observation position, camera observation direction, robotic arm joint posture, expected observation area, and expected motion cost.
[0149] The system performs a data pre-evaluation for each candidate re-observation pose. For mask uncertainty reduction benefits, the system determines whether the re-observation pose can show the target object boundary, target parts, and separation region between the target and adjacent objects from a new viewing angle; poses that can observe more boundary details are marked as high mask improved poses.
[0150] For point cloud integrity improvement benefits, the system determines whether the re-observed pose can fill in the missing areas of the current point cloud, especially the candidate gripper contact area, the target object's back area, the occluded area, and the area related to center of gravity estimation; poses that can significantly fill in the three-dimensional shape of the target object are marked as high point cloud complete poses.
[0151] To improve the discrimination of grasping candidates, the system determines whether the re-observation pose can more clearly observe the key differences between multiple candidate grasping poses, such as whether the contact surface is flat, whether the gripper closure path is blocked, whether the target part is actually accessible, and whether the gripper will pass through the prohibited contact area. Observation poses that can help the system to clearly select the optimal candidate grasping pose are marked as high discrimination poses.
[0152] To reduce the benefits of collisions in unknown spaces, the system determines whether the observation pose can observe the current unknown space, especially the approach path of the gripper, the movement path of the robotic arm link, the lifting path of the target object, and the gap between adjacent obstacles; observation poses that can convert unknown space into free space or occupied space are marked as high-safety supplementary poses.
[0153] Simultaneously, the system also determines whether the candidate re-observation pose is reachable, whether the movement is safe, whether it will cause the robotic arm to obstruct the target object, whether it will collide with the environment, and the distance and time required to move to that pose. If the candidate re-observation pose is unreachable or the movement risk is too high, it is discarded.
[0154] After completing the above evaluation, the system selects the pose with the best overall improvement and safest motion from the candidate re-observation poses as the target re-observation pose, and controls the end-effector camera of the robotic arm to move to this target re-observation pose. After reaching the target re-observation pose, the system re-acquires environmental RGB-D images, and performs time synchronization, coordinate registration, and point cloud fusion with the newly acquired images and historical images. Subsequently, it re-executes target segmentation, mask confidence update, point cloud confidence update, candidate grasping pose generation, and overall ranking.
[0155] Through the above processing, re-observation is not simply about taking another picture, but about selecting a new observation angle based on the weakest data link in the current capture decision, so as to improve the data reliability of subsequent capture decisions.
[0156] Further, data processing for optimal grasping pose execution and closed-loop correction;
[0157] When the highest comprehensive score meets the execution threshold, the candidate grasping pose with the highest comprehensive score is selected as the optimal grasping pose. The optimal grasping pose is then transformed into the robot's base coordinate system to generate a pre-grasping pose and a grasping execution path. After the robotic arm moves to the pre-grasping pose, it re-acquires local images of the target object using an end-effector depth camera. Based on the re-acquired local point cloud, the optimal grasping pose is corrected using a closed-loop method. Finally, the robotic arm is driven to perform the grasping operation through the ROS operating system.
[0158] In the above steps, the system first reads the candidate grasping pose with the highest ranking that meets the execution conditions from the candidate grasping pose sorting list, and determines it as the optimal grasping pose. The optimal grasping pose includes the gripper center point, gripper approach direction, gripper opening and closing direction, gripper width, grasping depth, and target contact area in the camera coordinate system.
[0159] The system transforms the optimal grasping pose from the camera coordinate system to the robot base coordinate system based on camera intrinsic and extrinsic parameters, hand-eye calibration parameters, and the robot's current joint state. After the transformation, the system generates grasping execution data in the robot base coordinate system. This grasping execution data includes the pre-grasping pose, approach pose, gripper closing position, target lifting pose, withdrawal pose, and the corresponding velocity, acceleration, and gripper opening / closing parameters for each stage.
[0160] The pre-grasping pose is set on the opposite side of the approach direction of the optimal grasping pose, allowing the robotic arm to reach a safe waiting position before actually contacting the target object. When generating the pre-grasping pose, the system simultaneously checks whether the robotic arm can smoothly move from its current position to the pre-grasping pose, ensuring that the movement path avoids environmental obstacles and the robot itself.
[0161] After the robotic arm moves to the pre-grasping pose, the system controls the end-effector depth camera to re-acquire local RGB-D images of the target object. At this time, because the end-effector camera is closer to the target object, it can obtain higher-precision local data on the target object's contact area, target parts, occlusion boundaries, and the gripper's approach path.
[0162] The system registers the newly acquired local point cloud at the pre-grasping pose with the target object point cloud previously used to generate the optimal grasping pose, and determines whether the target object has experienced positional shifts, pose changes, or occlusion changes during the robot's movement. If the positional and pose deviations between the local point cloud and the historical point cloud are within acceptable limits, the original optimal grasping pose is maintained, and the approach motion continues.
[0163] If the local point cloud shows that the target object has slightly shifted, but the target object remains visible and the gripper contact area still meets the task constraints, the system fine-tunes the optimal grasping pose based on the new local point cloud. This fine-tuning includes correcting the gripper center point position, approach direction, gripper opening / closing direction, and gripper width. After correction, the system re-checks whether the corrected grasping pose meets collision safety, robotic arm reachability, and task constraint requirements. If the local point cloud shows that the target object has shifted significantly, the target part is occluded, the prohibited contact area conflicts with the gripper path, or the corrected grasping pose fails to meet safety requirements, the system pauses the grasping execution and returns to the candidate grasping pose reordering process or the re-observation process.
[0164] After confirming the optimal grasping pose is valid, the system issues motion control commands to the robotic arm controller via the ROS operating system. These commands include robotic arm joint trajectory commands, end effector pose control commands, gripper opening and closing control commands, and execution status feedback subscription commands. The robotic arm executes the grasping operation in the sequence of pre-grabbing, approaching, closing the gripper, lifting the target object, and withdrawing. During gripper closing, the system reads gripper current, gripper displacement, gripping force, end effector torque, or tactile feedback data to determine whether the target object is effectively grasped. If the gripper closes to the predetermined position but no gripping resistance is detected, it is considered a false grasp; if the target object slips after grasping, it is considered an unstable grasp; if the gripping force exceeds the allowable range of the target object, it is considered an over-grip grasp. The system updates the historical execution records of the target object and similar grasping poses based on the above execution feedback, providing feedback data for subsequent grasping pose scoring.
[0165] In this embodiment, through the above-mentioned closed-loop correction process, the system can eliminate the pose deviation caused by visual errors, robotic arm motion errors, slight movement of the target object or changes in local occlusion before execution, thereby improving the actual grasping success rate and reducing the risks of collision, empty grasp, misgrabbing and slippage after grasping.
[0166] This invention transforms language parsing results into task constraints that can participate in geometric judgments, and incorporates semantic segmentation confidence, point cloud uncertainty, centroid shift, collision margin, and robotic arm reachability into the grasping decision process. Therefore, even in situations involving cluttered stacking, target occlusion, unregistered target categories, and natural language instructions containing part numbers or avoidance requirements, it can reduce the probability of target misidentification, background point intrusion, collision grasping, and unstable grasping.
[0167] Meanwhile, active re-observation and pre-grasp closed-loop correction enable the system to update the target point cloud and grasp pose based on real-time visual feedback before grasping, reducing the impact of one-time visual perception errors on the final execution result and improving the robot's grasping success rate, operational safety and task adaptability in open environments.
[0168] Example 3:
[0169] Figure 3 This is a structural block diagram of the local terminal of an exemplary electronic device (machine) of the present invention; as shown... Figure 4 As shown, the electronic device of the present invention includes a processor 11, a memory 12, a storage space 13 for storing program code, and program code 14 for executing the method steps according to the present invention. The program code 14 for executing the method steps according to the present invention is used to execute the above-described control logic.
[0170] Figure 4 This is a structural block diagram of the network end of an exemplary electronic device of the present invention; as shown below. Figure 4 As shown, the present invention also provides an electronic device (machine), which may include at least one processor 210, at least one memory 230 communicatively connected to the processor, and a communication bus 240 and a communication interface 220 connecting different system components (including the memory 230 and the processor 210). The processor 210, the memory 230 and the communication interface 220 are connected through the communication bus 240 and communicate with each other. The communication interface 220 is used for data interaction with external devices. The memory 230 stores a machine-executable program that can be executed by the processor, and the processor 210 can execute the above-mentioned control logic by calling the machine-executable program.
[0171] Communication bus 240 represents one or more of several bus architectures, including a memory bus or memory controller, peripheral bus, graphics acceleration port, processor, or local bus using any of the various bus architectures. Examples of these architectures include, but are not limited to, Industry Standard Architecture (ISA) buses, Micro Channel Architecture (MAC) buses, Enhanced ISA buses, Video Electronics Standards Association (VESA) local buses, and Peripheral Component Interconnect (PCI) buses.
[0172] Electronic devices typically include a variety of computer system readable media, which can be any available media that can be accessed by the electronic device, including volatile and non-volatile media, and removable and non-removable media.
[0173] Memory 230 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) and / or cache memory. The electronic device may further include other removable / non-removable, volatile / non-volatile computer system storage media. Memory 230 may include at least one program product having a set (e.g., at least one) of program modules configured to perform the control logic described above.
[0174] A program / utility having a set (at least one) of program modules can be stored in memory 230. Such program modules include, but are not limited to, an operating system, one or more applications, other program modules, and program data. Each or some combination of these examples may include an implementation of a network environment.
[0175] Machine-executable programs for performing this invention can be written in one or more programming languages or a combination thereof. These programming languages include object-oriented programming languages such as Java, C++, and Python, and may also include specialized engineering languages such as R. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0176] The present invention also discloses a storage medium storing a computer program, namely a machine-executable program as described above, which is executed as a VLA-based robot grasping method.
[0177] The aforementioned storage medium may be any combination of one or more computer-readable media. Computer-readable media may be, for example, computer-readable signal media or computer-readable storage media. Computer-readable storage media include, but are not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any combination thereof. More specific examples of computer-readable storage media (a non-exhaustive list) include: electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory, optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this invention, a computer-readable storage medium may be, for example, any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
[0178] Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, capable of sending, propagating, or transmitting programs for use by or in connection with an instruction execution system, apparatus, or device.
[0179] Program code contained on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.
[0180] This invention is not limited to the embodiments described above. Any changes in shape or structure shall fall within the protection scope of this invention. The protection scope of this invention is defined by the appended claims. Those skilled in the art may make various changes or modifications to these embodiments without departing from the principles and essence of this invention, but all such changes and modifications shall fall within the protection scope of this invention.
Claims
1. A robotically manipulated grasping method based on a VLA, characterized in that, The method includes the following steps: S1. Language instruction parsing: Receives natural language capture instructions from the user, parses the instructions using the vision-language-action big model, extracts the semantic description, operational constraints and task intent of the target object, and generates a structured task representation. S2. Visual perception and target segmentation: Environmental images are acquired by depth cameras installed at the end of the robotic arm and the robot body. The language prompts are input into the open vocabulary visual segmentation model to perform target detection and semantic segmentation on the environmental images and obtain the target object mask region corresponding to the language command. S3. Target point cloud generation and grasp detection: Based on the semantic segmentation mask, extract the target object point cloud from the depth image, input the target point cloud into the grasp detection model, generate multiple candidate grasp poses, and calculate the grasp score for each candidate grasp pose. S4. Grasping stability assessment and collision detection: Collision detection is performed on the generated candidate grasping poses to eliminate those that pose a risk of collision with the environment or the robot itself; at the same time, the stability of the remaining candidate grasping poses is scored based on the object's center of gravity distribution, and they are ranked according to the comprehensive score. S5. Optimal gripping pose selection and execution: Select the gripping pose with the highest score, convert it to the robot's base coordinate system, generate the corresponding motion control command, and drive the robotic arm to perform the gripping operation through the ROS operating system.
2. The VLA-based robotic manipulation grasping method of claim 1, wherein: The S1 specifically includes: S11: Receive the natural language capture instruction input by the user, and perform necessary preprocessing on the instruction to eliminate irrelevant modification information and ensure semantic integrity; S12: Input the preprocessed natural language grasping instructions into the vision-language-action big model for semantic parsing, extract the semantic description of the target object, operation constraints and task intent, and generate a structured task representation to guide subsequent visual perception and grasping planning.
3. The VLA-based robotic manipulation grasping method of claim 1, wherein: The S2 specifically includes: S21. Acquire color and depth image data of the working environment through a depth camera installed at the end of the robotic arm and the robot body; S22. Input the language prompts in the structured task representation generated in S1 into the open vocabulary visual segmentation model to perform target detection on the environmental image and obtain the target candidate regions corresponding to the language semantics. S23. Generate a refined semantic segmentation mask based on the detected target candidate region to obtain the target object mask region consistent with the natural language crawling instructions.
4. The VLA-based robotic manipulation grasping method of claim 3, wherein: S23 specifically includes: S231, Image Feature Extraction: The segmentation model encodes the input image I using a hierarchical visual Transformer to generate multi-scale visual features: ; wherein: is a hierarchical image encoder; is an image semantic feature of the l-th layer, is a height of the l-th layer feature map, is a width of the l-th layer feature map, is a number of channels of the l-th layer feature map; represents feature maps of different levels; S232, prompt encoding: based on the target candidate region B obtained in S22, a prompt encoder of the segmentation model is input with the segmentation prompt, and B is converted into a prompt feature: ; S233, Mask Prediction: This step inputs the cue features and image features into the attention fusion module of the segmentation model. Its core calculation is as follows: ; wherein: Wq is a learnable weight matrix, and P is a prompt feature; are keys and values generated by image features and model internal memory; is a normalized activation function; is a feature dimension, and the fused feature Generate a probability mask of the target region through a mask decoder: ; The prediction result is a candidate target mask to be screened, A sigmoid activation function, A mask decoder, A fusion feature; S234, optimal mask output: based on the plurality of mask candidates and the corresponding quality scores generated in S233 quality scores are represented globally by masks calculation: ; For learnable weights, the one with the highest score is chosen as the final output mask: ; wherein, are a plurality of candidate probability masks.
5. The robot grasping method based on VLA as described in claim 1, characterized in that: The S3 specifically includes: S31. Based on the semantic segmentation mask of the target object obtained in S2, extract the corresponding depth information from the depth image and back-project it into the three-dimensional space to generate the point cloud of the target object. S32. Input the target object point cloud into the point cloud-based grasping and detection model, and perform feature extraction and spatial analysis on the target point cloud; S33. Output multiple candidate grasp poses based on the grasp detection model, and calculate the corresponding grasp score for each candidate grasp pose.
6. The robot grasping method based on VLA as described in claim 5, characterized in that: The specific steps of S32 are as follows: S321, Point Cloud Geometric Feature Extraction: Based on the target 3D point cloud generated by S31, the grasping and detection model extracts features from the local curvature, surface normals, geometric structure, and overall spatial relationships of the point cloud; the encoding process forms a multi-scale semantic feature representation of the point cloud. ; in The point cloud geometric feature encoder used for capturing and detecting the model; It is a high-dimensional feature set of point clouds, containing local and global spatial semantic information; S322, Spatial Analysis: Point Cloud Features Based on S321 The grasping detection model then performs grasping-related spatial geometric analysis on the point cloud to predict the contact feasibility and grasping depth of the object surface. The grasping depth is calculated as follows: ; in, The predicted depth of the grab center point in the camera coordinate system; This represents the depth of the corresponding pixel in the point cloud. Here are the pixel coordinates on the image plane, where u is the horizontal direction and v is the vertical direction; To capture the depth of the pose relative to the object's surface.
7. The robot grasping method based on VLA as described in claim 1, characterized in that: The S4 specifically includes: S41. Simplify the gripper into three cubes. Based on the gripper geometry model and the environment point cloud model, perform collision detection on the candidate gripping poses and eliminate gripping poses that have a risk of collision with the environment or the robot body. S42. Estimate the geometric center or centroid position of the target object based on the point cloud of the target object, and combine it with the gripper contact plane to score the gripping stability of the remaining candidate gripping poses. S43. Based on the collision detection results and the grasping stability score, the candidate grasping poses are comprehensively ranked.
8. The robot grasping method based on VLA as described in claim 1, characterized in that: The S5 specifically includes: S51. Select the grab pose with the highest comprehensive score from the sorted candidate grab poses as the optimal grab pose. S52. Transform the optimal grasping pose from the camera coordinate system to the robot base coordinate system, and generate the corresponding robotic arm motion control command. S53. The motion control command is activated through the ROS operating system to start the corresponding node and drive the robotic arm to complete the grasping operation of the target object.
9. A robot-operated grasping system based on VLA, characterized in that, include: One or more processors; Memory, used to store one or more programs; When the one or more programs are executed by the one or more processors, the system performs the VLA-based robot manipulation grasping method as described in any one of claims 1-8.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by the processor, the program implements the VLA-based robot grasping method as described in any one of claims 1-8.