Method and system for robot grasping for zero-shot shape reconstruction empowerment
By using an octree-based machine learning model and contact constraint detection, the problem of zero-sample shape reconstruction in robot grasping is solved, achieving efficient 3D shape reconstruction and stable grasping posture prediction, which is suitable for robot grasping tasks in complex environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TOYOTA RESEARCH INSTITUTE INC
- Filing Date
- 2025-12-12
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies lack effective zero-sample shape reconstruction methods in robot grasping scenarios, leading to accidental collisions and unstable contact. Furthermore, multi-view reconstruction has high computational overhead, making it difficult to accurately reconstruct 3D shapes within confined spaces.
An octree-based machine learning model is employed, combined with a conditional variational autoencoder and a multi-object encoder. 3D reconstruction and grasping pose prediction are performed through contact constraints and collision detection. The 3D occlusion field is determined by ray projection, achieving efficient zero-sample shape reconstruction.
It achieves efficient 3D shape reconstruction and grasping pose prediction without prior knowledge, improves reconstruction quality and grasping stability, reduces computational overhead, and is suitable for robot grasping tasks in complex environments.
Smart Images

Figure CN122185154A_ABST
Abstract
Description
Technical Field
[0001] This specification relates to robotic grasping, and more particularly to a method and system for robotic grasping enabled by zero-sample shape reconstruction. Background Technology
[0002] To enable a robot to grasp objects in a scene, the robot can determine the object's grasping pose, which indicates how each object should be grasped. Robust robotic grasping may require an accurate geometric understanding of the target object and its surrounding environment. However, without explicitly modeling the geometry of the target object, accidental collisions and unstable contact with the target object can occur. Furthermore, pre-reconstructing the target object using multi-view images can introduce additional computational overhead and may require more complex setups. Additionally, multi-view reconstruction may be impractical for objects placed in confined spaces such as shelves or boxes. Furthermore, the lack of large-scale datasets with ground-based 3D shape and grasping pose annotations further complicates accurate 3D reconstruction from a single RGB-D image. In some instances, particularly for regression-based zero-shot 3D reconstruction, sparse voxel representations may outperform volumetric and NeRF-like implicit shape representations in terms of runtime, accuracy, and resolution. Thus, there is a need for improved methods and systems for zero-shot shape reconstruction-enabled robotic grasping. Summary of the Invention
[0003] In one embodiment, a method may include receiving training data, the training data including multiple images containing one or more objects, multiple depth maps associated with the multiple images, and ground reality data associated with the multiple images. The ground reality data may include shapes and grasping poses associated with the one or more objects in the multiple images. The method may further include training a machine learning model using the training data to receive a first image containing one or more first objects and a first depth map associated with the first image, and output a first shape and a first grasping pose of the one or more first objects. The machine learning model may include a conditional variational autoencoder, a multi-object encoder for encoding multi-object reasoning associated with objects, and a three-dimensional occlusion field determined by ray projection.
[0004] In another embodiment, a computing device may include one or more processors configured to receive training data, the training data including multiple images containing one or more objects, multiple depth maps associated with the multiple images, and ground reality data associated with the multiple images. The ground reality data may include shapes and grasping poses associated with the objects in the multiple images. The one or more processors may be further configured to train a machine learning model using the training data to receive a first image containing one or more first objects and a first depth map associated with the first image, and output a first shape and a first grasping pose of the one or more first objects. The machine learning model may include a conditional variational autoencoder, a multi-object encoder for encoding multi-object reasoning associated with objects, and a three-dimensional occlusion field determined by ray projection.
[0005] In another embodiment, a non-transitory computer-readable storage medium may include a memory storing a program that, when executed by a processor, causes the processor to receive training data, the training data including multiple images containing one or more objects, multiple depth maps associated with the multiple images, and ground reality data associated with the multiple images. The ground reality data may include shapes and grasping poses associated with the objects in the multiple images. The program may further cause the processor to train a machine learning model using the training data to receive a first image containing one or more first objects and a first depth map associated with the first image, and output a first shape and a grasping pose of the one or more first objects. The machine learning model may include a conditional variational autoencoder, a multi-object encoder for encoding multi-object reasoning associated with objects, and a three-dimensional occlusion field determined by ray projection. Attached Figure Description
[0006] The embodiments illustrated in the accompanying drawings are illustrative and exemplary in nature and are not intended to limit this disclosure. The following detailed description of the illustrative embodiments will be understood when read in conjunction with the following drawings, wherein like structures are indicated by like reference numerals, and wherein:
[0007] Figure 1 The architecture of a machine learning model for zero-shot shape reconstruction-enabled robotic grasping is schematically depicted according to one or more embodiments shown and described herein;
[0008] Figure 2 An illustration of one or more embodiments shown and described herein is provided for implementation. Figure 1 A schematic diagram of the computing device for a machine learning model;
[0009] Figure 3 Exemplary instance masks and occlusion fields are illustrated according to one or more embodiments shown and described herein;
[0010] Figure 4 A detailed description of exemplary grasping gestures according to one or more embodiments shown and described herein is provided;
[0011] Figure 5 Operation according to one or more embodiments shown and described herein is depicted. Figure 2 A flowchart illustrating an exemplary method for using a computing device to train a machine learning model; and
[0012] Figure 6 The operation of one or more embodiments shown and described herein after a machine learning model has been trained is depicted. Figure 2 A flowchart of an exemplary method of a computing device. Detailed Implementation
[0013] The embodiments disclosed herein provide a novel framework for near real-time 3D reconstruction and 6D grasping pose prediction. The embodiments disclosed herein enhance grasping pose prediction by leveraging physics-based contact constraints and collision detection. Since robotic environments typically involve multiple objects with inter-object occlusion and close contact, the embodiments disclosed herein include a multi-object encoder and a 3D occlusion field. These components effectively model inter-object relationships and occlusion, thereby improving reconstruction quality. Additionally, the embodiments disclosed herein utilize a refinement algorithm to improve the grasping pose using the predicted reconstruction. The reconstruction generated by the embodiments disclosed herein provides reliable contact points and collision masks between the gripper (e.g., a robotic arm) and the target object, which can be used to refine the grasping pose.
[0014] In the embodiments disclosed herein, a machine learning model can be trained to receive an input image and a depth map associated with the image. The image may include one or more objects. The machine learning model can be trained to output a grasping pose for the objects in the image. In particular, the machine learning model can be trained to simultaneously perform a 3D reconstruction of a scene captured by the image and predict a grasping pose for the objects in the image. Thus, after the machine learning model is trained, it can be used by a robotic arm or other gripper to grasp real-world objects. For example, a robotic arm can capture an image and a depth map of a scene containing one or more objects. The image can be input into a trained machine learning model, which can output a grasping pose for the objects. The robotic arm can then grasp and manipulate one or more of the objects based on the output grasping pose.
[0015] Known grasping pose prediction methods typically assume prior knowledge of the 3D object and rely on simplified analytical models based on the force closure principle. However, the embodiments disclosed herein allow for zero-shot robotic grasping, referring to the ability to grasp unseen target objects without prior knowledge. In particular, the embodiments disclosed herein describe an efficient and general model for simultaneously performing 3D shape reconstruction and grasping pose prediction from a single RGB-D observation. The predicted reconstruction can be used to refine the grasping pose via contact-based constraints and collision detection.
[0016] In this embodiment, an octree is used as a shape representation, where attributes such as image features, a signed distance function (SDF), normal vectors on the object surface (referred to herein as normals), and grasping poses are defined at the deepest level of the octree. In one example, the input octree can be represented as a tuple of voxel centers p at the final depth associated with image features f.
[0017]
[0018] Where N is the number of voxels. Unlike point clouds, the octree structure enables efficient depth-first search and recursive subdivision of octets, making it ideal for high-resolution shape reconstruction and dense grasping pose prediction in a memory- and computationally efficient manner.
[0019] In this embodiment, the grasping posture can be represented using a general two-finger parallel gripper model. Figure 4 An exemplary two-finger parallel gripper 400 with fingers 402 and 404 is shown. In an embodiment, the gripping posture may include the following components: gripping degree. Its indication of the robustness of the grasping position; quality It can be calculated using the force closure algorithm; proximity vector Tangential vector width and depth
[0020] g=[vqatwd],#(2)
[0021] Here, M denotes the number of voxels in the target octree, and the nearest grasping pose within a 5 mm radius is assigned to each point. If it does not exist, its corresponding grasping degree is set to 0. In an embodiment, the rotation matrix can be recovered from the proximity vector and tangential vector using Gram-Schmidt orthogonalization. The rotation matrix can be defined in the gripper coordinate system. The target octree can be defined using the grasping pose g.
[0022] y = (p gt ,fgt)=(pgt ,[sng]),#(3)
[0023] in, It is SDF, and It is the normal vector of the target octree.
[0024] Now turn to the attached image. Figure 1 An exemplary architecture of the machine learning model 100 disclosed herein is illustrated. As described above, the machine learning model 100 can be trained to receive an input RGB-D image and output a grasping pose for prediction of objects in the image. In particular, given an input octree x consisting of partial point clouds of each instance derived from a depth map and an instance mask, along with their corresponding image features, the machine learning model 100 predicts a 3D reconstruction and grasping pose represented as the octree. As disclosed in this paper, the machine learning model 100 is built on an octree-based U-Net and a conditional variational autoencoder (CVAE) to model shape reconstruction uncertainty and grasping pose prediction while maintaining near real-time inference. The following section combines... Figure 2 The components of machine learning model 100 are discussed in more detail.
[0025] Figure 2 A computing device 200 is depicted for performing zero-sample shape reconstruction-enabled robotic grasping as disclosed herein. In particular, the computing device 200 can be used for training… Figure 1 The machine learning model 100, and the machine learning model 100 is used after it has been trained.
[0026] exist Figure 2 In the example, computing device 200 includes one or more processors 202, one or more memory modules 204, network interface hardware 206, and communication path 208. The one or more processors 202 may be a controller, integrated circuit, microchip, computer, or any other computing device. The one or more memory modules 204 may include RAM, ROM, flash memory, hard disk drive, or any device capable of storing machine-readable and executable instructions such that the machine-readable and executable instructions can be accessed by the one or more processors 202.
[0027] Network interface hardware 206 can be communicatively coupled to communication path 208 and can be any device capable of transmitting and / or receiving data via a network. Therefore, network interface hardware 206 may include a communication transceiver for transmitting and / or receiving any wired or wireless communication. For example, network interface hardware 206 may include an antenna, modem, LAN port, Wi-Fi card, WiMax card, mobile communication hardware, near-field communication hardware, satellite communication hardware, and / or any wired or wireless hardware for communicating with other networks and / or devices. In one embodiment, network interface hardware 206 includes components configured to operate according to Bluetooth. Hardware for operating wireless communication protocols. The network interface hardware 206 of the computing device 200 can receive images captured by one or more cameras, as disclosed in more detail below.
[0028] One or more memory modules 204 include a database 212, an image receiving module 214, a training data receiving module 216, an image encoder module 218, an instance mask module 220, a back projection module 222, an octree conversion module 223, a prior octree encoder module 224, a posterior octree encoder module 226, a decoder module 228, a multi-object encoder module 230, a 3D occlusion field module 232, a training module 234, an inference module 236, and a grasping pose refinement module 238. Each of the following modules can be a program module: database 212, image receiving module 214, training data receiving module 216, image encoder module 218, instance mask module 220, back projection module 222, octree conversion module 223, prior octree encoder module 224, posterior octree encoder module 226, decoder module 228, multi-object encoder module 230, 3D occlusion field module 232, training module 234, inference module 236, and grasping pose refinement module 238. Each of these modules can be a program module in the form of an operating system, application module, or other program module stored in one or more memory modules 204. In some embodiments, the program module can be stored in a remote storage device capable of communicating with the computing device 200. Such program modules can include, but are not limited to, routines, subroutines, programs, objects, components, data structures, and the like for performing specific tasks or performing specific data types, as described below.
[0029] Database 212 can store image data, depth map data, and training data used to train machine learning model 100, as disclosed herein. Database 212 can also store parameters of machine learning model 100 during training.
[0030] Still referencing Figure 2The image receiving module 214 can receive images and depth maps (e.g., RGB-D images) of a scene containing one or more objects. The received images can be fed into a machine learning model 100 after training, which can output a predicted grasping pose, as disclosed in more detail herein. The predicted grasping pose can be used to grasp and manipulate objects using a robotic arm or other gripper.
[0031] Still referencing Figure 2 The training data receiving module 216 can receive training data that can be used to train the machine learning model 100, as disclosed in more detail herein. In embodiments, the training data received by the training data receiving module 216 may include multiple images (each containing one or more objects), depth maps associated with the images, and ground reality octree data associated with the images. The ground reality octree data may include a grasping pose for each object in the image, a normal for each object in the image, and an SDF for each object in the image, such as... Figure 1 The target octree y128 is shown.
[0032] Still referencing Figure 2 The image encoder module 218 can encode images received by the image receiving module 214 and / or the training data receiving module 216 to generate features. In particular, RGB images... It can be encoded to extract image features W. For example... Figure 1 As shown, an exemplary image 102 can be input to an image encoder 106 to encode the image 102 to generate image features. Figure 1 The image encoder 106 can be made by Figure 2 The image encoder module 218 is implemented as follows. Figure 1 As shown, the image features generated by the image encoder module 218 can be included in the input octree x that is input into the machine learning model 100.
[0033] Return to reference Figure 2 The instance mask module 220 can identify objects in images received by the image receiving module 214 or the training data receiving module 216, and can generate a two-dimensional instance mask for each identified object. Specifically, the instance mask module 220 can generate two-dimensional instance masks. Instance Mask M i It can represent the mask of the i-th object. Figure 3 A scene containing objects 302 and 304 is shown. Figure 3 In the example, object 304 occludes object 302. Figure 3An exemplary two-dimensional instance mask 306 that can be generated for scene 300 by instance mask module 220 is shown. In particular, instance mask 306 includes two-dimensional projections of objects 302 and 304. (Return to Reference) Figure 1 The diagram illustrates the instance mask M 114 applied to the input octree x 110 and the 3D occlusion field V 122. The instance mask M 114 can be derived from... Figure 2 The instance mask module 220 is generated.
[0034] Return to reference Figure 2 For each object identified by the instance mask module 220, the backprojection module 222 can backproject the image features generated by the image encoder module 218 into three-dimensional space. Specifically, the backprojection module 222 can... i ,w i ) = π -1 (W,D,K,M i The image features are back-projected into three-dimensional space, where q i and w i The 3D point cloud of the i-th object and its corresponding features are labeled respectively. Here, π is... Figure 1 The back projection function shown in 108 is the back projection function. It is a depth map, and The camera intrinsics of the camera that captured the image. Figure 1 In the example, an exemplary depth map 104 corresponding to exemplary image 102 is shown.
[0035] Return to reference Figure 2 The octree conversion module 223 can convert the 3D point cloud features generated by the backprojection module 222 into an octree. Specifically, the octree conversion module 223 can convert the 3D point cloud features into an octree. i =(p i ,f i )=G(q i ,w i ), where G is the transformation function from point cloud and its features to octree.
[0036] Return to reference Figure 1 To improve shape reconstruction quality, machine learning model 100 utilizes probabilistic modeling via an octree-based conditional variational autoencoder (CVAE) 101 to address the inherent uncertainties in single-view shape reconstruction, which is crucial for improving reconstruction and grasping pose prediction quality. Figure 1 In the example, the octree-based CVAE 101 includes a posterior encoder. Prior encoder and decoder The latent representations of the 3D shape and the grasping posture are learned together as a diagonal Gaussian.
[0037] In the embodiment, encoder ε(z) i |x i ,y i It can learn prediction-based octrees x i And ground-based live octree y i And predicting potential code z i 116, such as Figure 1 As shown in the image. Potential code z i 116 can be projected into a lower-dimensional space to generate latent features. In particular, a priori octree x i As input and calculate latent features and code Where N i ′ and D ′ It represents the number of points and the dimension of latent features. Internally, the latent code is sampled from the predicted mean and variance via reparameterization. Decoder Predicting 3D reconstruction along with grasping pose. To save computational costs, decoder 126 can predict occupancy at each depth, thus discarding mesh cells with probabilities below 0.5. Only in the last layer does the decoder predict SDF, normal vector, grasping pose, and occupancy. During training, the KL divergence between the encoder and the prior is minimized so that their distributions match. Return to reference Figure 2 The prior octree encoder module 224 can be implemented. Figure 1 Prior encoder The posterior octree encoder module 226 can be implemented Figure 1 posterior encoder Furthermore, decoder module 228 can implement the decoder.
[0038] As discussed above, the prior encoder 112. It calculates the features of each object. Thus, it lacks the ability to model the global spatial arrangement for collision-free reconstruction and grasping pose prediction. Therefore, as... Figure 1 As shown, the machine learning model 100 includes a multi-object encoder. In particular, multi-object encoders Encodes multi-object reasoning to identify relationships between objects in an image. In one example, a multi-object encoder... Includes a transformer in the latent space, which consists of K standard transformer blocks with self-attention and rotationally coded position (RoPE) position encoding. Multi-object encoder. Voxel Center And the characteristics of all objects in its potential space. Updated to
[0039]
[0040] Where L represents the total number of objects. (See reference) Figure 2 The multi-object encoder module 230 can be implemented Figure 1 Multi-object encoder
[0041] Return to reference Figure 1 Occlusion between objects in the image can be considered by machine learning model 100 using a three-dimensional occlusion field v122, as disclosed in this paper. The multi-object encoder discussed above... The primary focus is on avoiding collisions between objects and grasping poses in cluttered scenes, as collision modeling only requires local context, allowing it to be handled earlier. In contrast, occlusion modeling requires a comprehensive understanding of the global context to accurately capture visibility relationships, as occluders and occluded objects may be placed far apart. To mitigate this issue, a 3D occlusion field... Visibility information can be located at voxels through simplified octree-based stereo rendering.
[0042] In an embodiment, Figure 2 The 3D occlusion field module 232 can be used to generate Figure 1 The 3D occlusion field v122. The 3D occlusion field can encode mutual occlusion and self-occlusion information via simple ray projection. In particular, the 3D occlusion field module 232 can project rays from the camera to the voxel centers around the target object and can perform depth testing. This can be achieved in... Figure 1 As seen in the image, rays 308, 310, 312, and 314 are projected from camera 301 onto objects 302 and 304. In particular, the voxels in the latent space are subdivided into B... 3 Smaller blocks (B blocks per axis) are projected into the image space. Figure 3 In the example, an occlusion field is determined for object 302, and object 304 is an occlusion object for object 302. The occlusion field can also be determined separately for object 304.
[0043] If the ray intersects the target object, that is, if the block is located within the instance mask corresponding to the target object, then the 3D occlusion field module 232 can display the self-occlusion flag. self Set to 1. This is in Figure 3In the example shown, ray 310 intersects with object 302. If the ray intersects with a non-target object, i.e., if the block is located within the instance mask of an adjacent object, the 3D occlusion field module 232 can display the mutual occlusion flag. inter Set to 1. This is determined by... Figure 3 The ray 314 shows that the ray 314 intersects with the object 304.
[0044] After calculating the labels for all objects in the image, the 3D occlusion field module 232 can construct a 3D occlusion field by concatenating the two labels of the i-th object. Then, the 3D occlusion field module 232 can encode the 3D occlusion field using a three-layer 3D convolutional neural network (CNN), wherein the resolution is downsampled by half in each layer to obtain the occlusion features in the latent space. And through Update potential features to account for both occlusion and collision.
[0045] Return to reference Figure 2 Training module 234 can train machine learning model 100, as disclosed herein. Training module 234 trains the parameters of the machine learning model 100 to minimize the octree of predictions output by the machine learning model 100. The loss function between the target octree y128 (ground reality value) and the target octree. Specifically, similar to a standard variational autoencoder (VAE), the training module 234 trains the machine learning model 100 by maximizing the lower bound of evidence (ELBO). Therefore, the loss function is defined as...
[0046]
[0047] in, Calculate the mean of the binary cross-entropy (BCE) function for occupancy at each depth h, and and These represent the surface normal and the average L2 distance of the SDF at the final depth of the octree, respectively. and Calculate the average L2 distance for gripping degree, quality, proximity vector, width, and depth, respectively. Due to the symmetry of the gripper, the tangential vector... The loss term calculates the average sign-independent L2 distance as D. SA (a,b)=min(‖ab‖2,‖a+b‖2). Finally, the term Measurement posterior encoder With prior encoder The KL divergence between them. Each term ω is a weight parameter used to align the scales of different loss terms.
[0048] During training, training module 234 learns the posterior encoder. Prior encoder decoder and multi-object encoder The parameters for each element are defined in the database 212. The learned parameters can be stored in the database 212. After the machine learning model 100 has been trained, the learned parameters can be used to predict the grasping pose of objects in unknown images, as discussed in more detail below.
[0049] Return to reference Figure 2 The inference module 236 can be used to perform inference using the machine learning model 100 after it has been trained. Specifically, an image of a scene containing one or more objects and a depth map associated with the image can be received by the image receiving module 214. The inference module 236 can then input the image and depth map into the trained machine learning model 100. During inference, a posterior encoder may not be used. This component is only used during the training of machine learning model 100. Decoder It can output an octree indicating the gripping, normals, and SDF predictions of objects in an image. As discussed above, gripping can indicate how each object in the scene can be gripped. Therefore, a gripper (e.g., a robotic arm) can then use the predicted gripping to grip and manipulate one or more objects in the scene.
[0050] This allows the gripper to grasp objects in the scene. However, accurate contact is desirable for successful grasping, as it ensures stability and control during manipulation. While machine learning models 100 predict the width and depth of the gripper, even small errors can lead to unstable grasping. Therefore, in this embodiment, Figure 2 The grasping pose refinement module 238 can refine the grasping pose predicted by the machine learning model 100, as disclosed herein.
[0051] Figure 4 It shows a left-finger C L 402 and right finger C R An exemplary gripper 400 of 404. In an embodiment, the gripping posture refinement module 238 can adjust the position of the gripper's fingertips to align with the reconstructed upper left finger C. L and right finger c R Align the nearest contact point.
[0052] Based on the contact point, the width w is refined to
[0053] Δw=min(D(C L ),D(C R )),#(9)
[0054] w←w+2(max(γ min (Δw,γ mzx ))-Δw),#(10)
[0055] Keep the contact distance Δw at γ min to γ max Within the range. Note that D(c) indicates the contact distance from c. The gripping posture refinement module 238 can further adjust the depth d using the following formula.
[0056] d←max(Z(C L ),Z(C R )),#(11)
[0057] Z(c) calculates the depth of the contact point c. An example of this refined grasping posture is shown below. Figure 4 As shown, the initial grasping posture 406 is modified into the final grasping posture 408. These refinement steps help ensure a stable grasp.
[0058] Additionally, the grasping pose refinement module 238 can perform collision detection to identify predicted grasping poses that would lead to collisions with occluded areas. In particular, the grasping pose refinement module 238 can use a two-finger parallel gripper (e.g., based on the reconstructed shape of the object in the image) Figure 4 A two-finger parallel gripper 400 is used to implement a model-free collision detector. Then, the grip pose refinement module 238 can discard the predicted grip pose that would lead to a collision with the occluded area.
[0059] Figure 5 A flowchart illustrating an exemplary method for operating computing device 200 to train machine learning model 100, as disclosed herein, is provided. At step 500, training data receiving module 216 receives training data. As discussed above, the training data may include multiple images containing one or more objects, multiple depth maps associated with the multiple images, and ground reality data including shapes and grasping poses associated with one or more objects in the multiple images. In particular, the ground reality data may include octree data including a grasping pose for each object in the image, a normal for each object in the image, and an SDF for each object in the image. At step 502, training module 234 may train machine learning model 100 based on the received training data using the techniques discussed above. In particular, training module 234 may use the training data to train machine learning model 100 to receive a first image containing one or more first objects and a first depth map associated with the first image, and output a first shape for one or more first objects and a first grasping pose for one or more first objects.
[0060] Figure 6 A flowchart depicts an exemplary method for operating computing device 200 after machine learning model 100 has been trained. At step 600, image receiving module 214 receives a second image of a scene containing one or more second objects and a second depth map associated with the second image. At step 602, octree conversion module 223 generates an octree based on the second image and the second depth map, as discussed above. At step 604, inference module 236 inputs the octree into the trained machine learning model 100. At step 606, inference module 236 predicts a grasping pose for one or more second objects in the scene based on the output of the trained machine learning model 100. At step 608, grasping pose refinement module 238 refines the predicted grasping pose using the techniques described above. In some examples, computing device 200 may cause a gripper to grasp and manipulate one or more objects based on the refined grasping pose.
[0061] It should now be understood that the embodiments described herein relate to a method and system for robot grasping empowered by zero-sample shape reconstruction. Using the techniques described herein, machine learning models can be trained to accurately predict 3D reconstructions of objects and grasping poses based on previously unseen images. An efficient depth-first search is achieved using an octree as a shape representation, ideal for high-resolution shape reconstruction and dense grasping pose prediction in a memory- and computationally efficient manner. A multi-object encoder models relationships between objects via a 3D transformer in latent space, thereby achieving collision-free 3D reconstruction and grasping pose. A 3D occlusion field captures self-occlusion and inter-object occlusion to enhance shape reconstruction in occluded regions.
[0062] Note that the terms “substantially” and “about” may be used in this document to indicate the degree of uncertainty that may be attributable to any quantitative comparison, value, measurement, or other representation. These terms are also used in this document to indicate the extent to which a quantitative representation may be varied from the stated reference without altering the fundamental function of the subject matter in dispute.
[0063] While specific embodiments have been illustrated and described herein, it should be understood that various other changes and modifications may be made without departing from the spirit and scope of the claimed subject matter. Furthermore, although various aspects of the claimed subject matter have been described herein, such aspects need not be combined. Therefore, the appended claims are intended to cover all such changes and modifications within the scope of the claimed subject matter.
Claims
1. A method comprising: Receive training data, the training data including multiple images containing one or more objects, multiple depth maps associated with the multiple images, and ground reality data associated with the multiple images, the ground reality data including object shapes and grasping postures associated with the one or more objects in the multiple images; as well as The training data is used to train a machine learning model to receive a first image containing one or more first objects and a first depth map associated with the first image, and to output a first shape of the one or more first objects and a first grasping pose of the one or more first objects. The machine learning model includes: Conditional variational autoencoder; Multi-object encoders are used to encode multi-object reasoning associated with objects; and The three-dimensional occlusion field is determined by the projection of light.
2. The method according to claim 1, further comprising: Determine the image features associated with the plurality of images; The image features are converted into an octree; and The octree is input into the machine learning model during the training of the machine learning model.
3. The method according to claim 2, further comprising: Identify one or more objects in the plurality of images; Generate two-dimensional instance masks for one or more objects in the plurality of images; and The image features are back-projected into three-dimensional space based on the two-dimensional instance mask and the instance mask.
4. The method according to claim 2, wherein, The conditional variational autoencoder includes: A first encoder is used to receive the ground condition data and output a potential code; A second encoder is used to receive the octree as input and output latent features; and A decoder is used to predict the 3D reconstruction of the object's shape and the grasping posture.
5. The method according to claim 1, wherein, The multi-object encoder is configured to encode the multi-object reasoning to avoid collisions between one or more objects in the plurality of images.
6. The method of claim 1, further comprising determining the three-dimensional occlusion field by: Light is projected from the camera onto the voxel centers surrounding a target object among one or more objects in the plurality of images; If the ray intersects with the target object, then set the self-occlusion flag to 1; and If the ray intersects with a non-target object, set the inter-object occlusion flag to 1.
7. The method according to claim 1, wherein, The grasping posture includes grasping degree, quality, proximity vector, tangential vector, width, and depth.
8. The method according to claim 1, further comprising: A second image containing one or more second objects and a second depth map associated with the second image are input into a trained machine learning model; and A second grasping gesture associated with the one or more second objects is determined based on the output of the trained machine learning model.
9. The method according to claim 8, further comprising: The gripping posture is adjusted by adjusting the position of the gripper's fingertips to align with the proximal contact point on the reconstruction of the one or more second objects.
10. A computing device comprising one or more processors, said one or more processors being configured to: Receive training data, the training data including multiple images containing one or more objects, multiple depth maps associated with the multiple images, and ground reality data associated with the multiple images, the ground reality data including object shapes and grasping postures associated with the one or more objects in the multiple images; as well as The training data is used to train a machine learning model to receive a first image containing one or more first objects and a first depth map associated with the first image, and to output a first shape of the one or more first objects and a first grasping pose of the one or more first objects. in, The machine learning model includes: Conditional variational autoencoder; Multi-object encoders are used to encode multi-object reasoning associated with objects; and The three-dimensional occlusion field is determined by the projection of light.
11. The computing device according to claim 10, wherein, The one or more processors are further configured to: Determine the image features associated with the plurality of images; The image features are converted into an octree; and The octree is input into the machine learning model during the training of the machine learning model.
12. The computing device according to claim 11, wherein, The one or more processors are further configured to: Identify one or more objects in the plurality of images; Generate two-dimensional instance masks for one or more objects in the plurality of images; and The image features are back-projected into three-dimensional space based on the two-dimensional instance mask and the instance mask.
13. The computing device according to claim 12, wherein, The conditional variational autoencoder includes: A first encoder is used to receive the ground condition data and output a potential code; A second encoder is used to receive the octree as input and output latent features; and A decoder is used to predict the 3D reconstruction of the object's shape and the grasping posture.
14. The computing device according to claim 10, wherein, The multi-object encoder is configured to encode the multi-object reasoning to avoid collisions between one or more objects in the plurality of images.
15. The computing device according to claim 10, wherein, The one or more processors are further configured to determine the three-dimensional occlusion field in the following manner: Light is projected from the camera onto the voxel centers surrounding a target object among one or more objects in the plurality of images; If the ray intersects with the target object, the self-occlusion flag is set to 1; and If the ray intersects with a non-target object, set the inter-object occlusion flag to 1.
16. The computing device according to claim 10, wherein, The grasping posture includes grasping degree, quality, proximity vector, tangential vector, width, and depth.
17. The computing device according to claim 10, wherein, The one or more processors are further configured to: A second image containing one or more second objects and a second depth map associated with the second image are input into a trained machine learning model; and A second grasping gesture associated with the one or more second objects is determined based on the output of the trained machine learning model.
18. The computing device according to claim 17, wherein, The one or more processors are further configured to adjust the gripping posture by adjusting the fingertip position of the gripper to align with the proximal contact point on the reconstruction of the one or more second objects.
19. A non-transitory computer-readable storage medium comprising a memory storing a program, said program causing the processor, when executed by a processor, to: Receive training data, the training data including multiple images containing one or more objects, multiple depth maps associated with the multiple images, and ground reality data associated with the multiple images, the ground reality data including object shapes and grasping postures associated with the one or more objects in the multiple images; as well as The training data is used to train a machine learning model to receive a first image containing one or more first objects and a first depth map associated with the first image, and to output a first shape of the one or more first objects and a grasping pose of the one or more first objects. in, The machine learning model includes: Conditional variational autoencoder; Multi-object encoders are used to encode multi-object reasoning associated with objects; and The three-dimensional occlusion field is determined by the projection of light.
20. The non-transitory computer-readable storage medium according to claim 19, wherein, The program further enables the processor to: A second image containing one or more second objects and a second depth map associated with the second image are input into a trained machine learning model; as well as A second grasping gesture associated with the one or more second objects is determined based on the output of the trained machine learning model.