A grasping and placing method and device based on image editing and multimodal thought chain reasoning.
By employing image editing and multimodal thinking chain reasoning-based grasping and placement methods, the problem of precise robot operation in complex scenarios was solved, enabling accurate grasping and placement of target objects and improving the system's adaptability and operational accuracy in dynamic environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- FUDAN UNIVERSITY
- Filing Date
- 2024-12-30
- Publication Date
- 2026-06-30
AI Technical Summary
Existing robots lack the ability to perform precise operations in complex scenarios, especially when handling transparent or reflective objects or objects with complex geometric shapes. They also consume a lot of computational resources, and the applicability of imitation learning methods in dynamic scenarios is limited.
A grasping and placement method based on image editing and multimodal thinking chain reasoning is adopted. The target object is segmented, adjusted and background is blended through an image editing model. The scene is reconstructed by combining a 3D Gaussian model. Operation steps are generated by multimodal thinking chain reasoning. The grasping device is controlled by behavior tree to perform precise grasping and placement.
It achieves precise grasping and placement of target objects in complex scenarios, reduces geometric errors, improves the system's adaptability and operational robustness in dynamic environments, and enhances the efficiency and accuracy of task planning.
Smart Images

Figure CN122299604A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of robot automated operation technology, specifically relating to a grasping and placement method and device based on image editing and multimodal thought chain reasoning. Background Technology
[0002] In modern robotics, sophisticated environmental perception and task planning are crucial for robots to perform complex operations. However, with increasing task complexity and scene diversity, existing technologies still exhibit significant limitations in handling precise manipulation in complex environments. Traditional robot perception and manipulation largely rely on methods such as imitation learning and pose estimation. While these methods perform well in simple or fixed scenarios, they often face challenges such as low accuracy, high computational resource consumption, and poor adaptability to dynamic environmental changes when dealing with more complex scene variations.
[0003] Traditional methods for 6-DOF pose estimation and object recognition typically rely on high-precision sensors and complex environment modeling, making them ineffective in handling transparent or reflective objects, or objects with complex geometries. Especially when robots need to perform multi-step operations in complex environments, existing robotic systems based on imitation learning and sequential decision-making, while capable of basic grasping and placement tasks, show significant limitations in performance and flexibility when dealing with changes in object posture and position, as well as interference between multiple objects. Furthermore, the high cost and heavy reliance on demonstration quality of imitation learning also restricts its generalizability in dynamic and changing scenarios.
[0004] Therefore, there is still considerable room for improvement in the current perception and manipulation capabilities of robots. Summary of the Invention
[0005] This invention is made to solve the above-mentioned problems, and aims to provide a grasping and placement method and apparatus based on image editing and multimodal thought chain reasoning.
[0006] This invention provides a grasping and placement method based on image editing and multimodal thinking chain reasoning, used to control a grasping device to grasp a target object and place it at a designated location. It includes the following steps: Step S1, acquiring an RGB sequence of the target object with camera pose, and constructing a 3D Gaussian model of the scene based on the RGB sequence; Step S2, selecting multiple multi-view images from the RGB sequence as pre-editing reference view images; Step S3, editing the target object in each pre-editing reference view image, setting the target object at a designated location to obtain the corresponding edited image, and updating the 3D Gaussian model based on all edited images; Step S4, obtaining post-editing reference visual images corresponding to each pre-editing reference view image based on the 3D Gaussian model; Step S5, inputting the pre-editing reference view images and post-editing reference visual images into a multimodal model, and generating multiple operation steps through multimodal thinking chain reasoning; Step S6, controlling the grasping device to grasp the target object and place it at the designated location based on all operation steps, the pre-editing reference view images, and the post-editing reference visual images.
[0007] The grasping and placement method based on image editing and multimodal thought chain reasoning provided by this invention may also have the following features: In step S3, the reference view image before editing is input into the image editing model to obtain the corresponding edited image. The process of the image editing model editing the reference view image before editing includes the following steps: Step T1, segmenting the reference view image before editing and identifying the target object; Step T2, moving the target object to a specified position and filling in the original position of the target object; Step T3, obtaining scene depth information according to the MiDaS monocular depth estimation algorithm and determining whether the target object is missing. If so, the target object is filled in and step T4 is executed; otherwise, step T4 is executed; Step T4, adjusting the size of the target object according to the specified position and scene depth information; Step T5, performing background fusion on the target object at the specified position using a local fusion algorithm to obtain the edited image.
[0008] The grasping and placement method based on image editing and multimodal thinking chain reasoning provided by the present invention may also have the following features: In step T1, the target object is selected by the user, and the reference view image before editing is segmented by the HQ-SAM model to identify the target object. When the boundary of the target object is blurred, the ViTMatte matting algorithm is used to accurately segment it to obtain the clear boundary of the target object.
[0009] The grasping and placement method based on image editing and multimodal thinking chain reasoning provided by the present invention may also have the following feature: wherein, in step T4, the size of the target object is adjusted so that the perspective relationship of the target object at the specified position is consistent with that of the target object at the original position.
[0010] The grasping and placement method based on image editing and multimodal thinking chain reasoning provided by the present invention may also have the following features: in step T2, the original position is naturally blended with the surrounding background by filling; in step T5, the background blending includes lighting, edge processing and shadow generation of the target object at the specified position.
[0011] The grasping and placement method based on image editing and multimodal thought chain reasoning provided by this invention may also have the following features: Step S6 includes the following sub-steps: Step S6-1, converting all operation steps into a behavior tree; Step S6-2, generating 6D pose data and 3D size data based on the reference view image before editing, the reference visual image after editing, and the object pose estimation model; Step S6-3, controlling the grasping device to grasp the target object and place it at the specified position based on the task planning in the behavior tree, combined with the 6D pose data and 3D size data.
[0012] The grasping and placement method based on image editing and multimodal thinking chain reasoning provided by the present invention may also have the following features: in step S6-2, the reference view image before editing and the reference visual image after editing are input into the target detection model to generate a feature containing the detection box and label of the target object, and the feature is input into the object pose estimation model to generate 6D pose data and 3D size data.
[0013] The grasping and placement method based on image editing and multimodal thinking chain reasoning provided by the present invention may also have the following feature: in step S6-3, after the target object is located at the specified position, the change in scene state is synchronously updated to the behavior tree.
[0014] This invention also provides a grasping and placing device based on image editing and multimodal thought chain reasoning, used to grasp a target object and place it at a designated location. It features the following components: a grasping module for grasping and moving the target object; a 3D model building module for acquiring an RGB sequence of the target object with camera pose and constructing a 3D Gaussian model of the scene based on the RGB sequence; a pre-editing reference view image generation module for obtaining multiple pre-editing reference view images of the target object from different perspectives based on the 3D Gaussian model; and a 3D model updating module for updating the target object in each pre-editing reference view image. The system comprises several modules: a line editing module, which sets the target object in a specified position to obtain the corresponding edited image and updates the 3D Gaussian model based on all edited images; a post-editing reference visual image generation module, which generates post-editing reference visual images corresponding to each pre-editing reference view image based on the 3D Gaussian model; an operation step generation module, which includes a multimodal model, which takes the pre-editing reference view image and the post-editing reference visual image as inputs into the multimodal model and generates multiple operation steps through multimodal reasoning; and a grasping control module, which controls the grasping module to grasp the target object and place it in a specified position based on all operation steps, the pre-editing reference view image, and the post-editing reference visual image.
[0015] The role and effect of invention
[0016] According to the grasping and placement method and apparatus based on image editing and multimodal thought chain reasoning of the present invention, on the one hand, the simulation and consistency maintenance of the target object's motion are achieved through image editing and 3D Gaussian model reconstruction; on the other hand, through multimodal thought chain reasoning and behavior tree generation technology, a task planning sequence is automatically generated and converted into an executable behavior tree based on the scenes before and after image editing. Therefore, the grasping and placement method and apparatus based on image editing and multimodal thought chain reasoning of the present invention can achieve accurate grasping and placement of target objects in complex scenes. Attached Figure Description
[0017] Figure 1 This is a block diagram of the grasping and placing device in an embodiment of the present invention;
[0018] Figure 2 This is a schematic diagram illustrating the process of editing the reference view image before editing the image editing model in an embodiment of the present invention;
[0019] Figure 3 This is a schematic diagram of the blurred and clear boundaries of the same target object in an embodiment of the present invention;
[0020] Figure 4 This is a schematic diagram of the target object before and after movement in an embodiment of the present invention;
[0021] Figure 5 This is a schematic diagram of the image editing model pre-training process in an embodiment of the present invention;
[0022] Figure 6 This is a flowchart of an embodiment of the present invention showing how the grasping control module controls the grasping module to grasp the target object and place it at a specified position;
[0023] Figure 7 This is a flowchart illustrating the grasping and placement method based on image editing and multimodal thought chain reasoning in an embodiment of the present invention. Detailed Implementation
[0024] To make the technical means, creative features, objectives and effects of this invention easy to understand, the following embodiments, in conjunction with the accompanying drawings, provide a detailed description of the grasping and placing method and apparatus based on image editing and multimodal thinking chain reasoning.
[0025] This embodiment provides a grasping and placing device based on image editing and multimodal thought chain reasoning, hereinafter referred to as the grasping and placing device, used to grasp target objects and place them at designated locations. In this embodiment, there are multiple different target objects in the same scene, each corresponding to a different designated location. The grasping and placing device based on image editing and multimodal thought chain reasoning can grasp each target object sequentially, ultimately placing all target objects in their corresponding designated locations.
[0026] Figure 1 This is a block diagram of the gripping and placing device in an embodiment of the present invention.
[0027] like Figure 1 As shown, the gripping and placing device 100 includes a gripping module 11, a 3D model building module 12, a pre-editing reference view image generation module 13, a 3D model updating module 14, a post-editing reference view image generation module 15, an operation step generation module 16, a gripping control module 17, and a central control module 18.
[0028] The grasping module 11 is used to grasp the target object and move the target object.
[0029] The 3D model building module 12 is used to obtain the RGB sequence of the target object with camera pose, and to build a 3D Gaussian model of the scene, i.e., a 3D Gaussian point cloud, based on the RGB sequence.
[0030] The pre-editing reference viewpoint image generation module 13 is used to select multiple multi-view images from the RGB sequence as pre-editing reference viewpoint images.
[0031] In this embodiment, all selected images from different perspectives include at least one image of the target object from a frontal view.
[0032] The 3D model update module 14 is used to edit the target object in each pre-edit reference view image, set the target object in a specified position, obtain the corresponding edited image, and update the 3D Gaussian model based on all edited images.
[0033] The 3D model update module 14 includes an image editing model. The 3D model update module 14 inputs the reference view image before editing into the image editing model to obtain the corresponding edited image. In this embodiment, the image editing model is a diffusion model based on StableDiffusion.
[0034] Figure 2 This is a schematic diagram illustrating the process of editing a reference view image before editing an image editing model in an embodiment of the present invention.
[0035] like Figure 2 As shown, the process of editing a reference view image before editing using an image editing model includes the following steps:
[0036] Step T1: Segment the reference view image before editing and identify the target object.
[0037] In step T1, the target object selected by the user is identified by segmenting the reference view image before editing using the HQ-SAM model.
[0038] In this embodiment, the HQ-SAM model can extract the boundary of the target object from the reference view image before editing. However, when the target object has a complex shape, its edges are similar in color or texture to the background, or it has complex surfaces such as semi-transparency or reflection, a blurred boundary will be generated, which includes the part outside the target object. To address this, this embodiment uses the ViTMatte matting algorithm for precise segmentation to obtain a clear boundary of the target object.
[0039] Figure 3 This is a schematic diagram of the blurred and clear boundaries of the same target object in an embodiment of the present invention.
[0040] like Figure 3 As shown, (a) is a schematic diagram of the blurred boundary of the target object, and (b) is a schematic diagram of the sharp boundary of the target object. In (a) and (b), the gray area within the white rectangle represents the extracted boundary of the target object, which is a bicycle. It is evident that the ViTMatte matting algorithm, compared to using only the HQ-SAM model, can obtain a clearer and more accurate boundary of the target object.
[0041] Step T2: Move the target object to the designated position and fill in the original position of the target object.
[0042] In step T2, the original position is filled in to blend naturally with the surrounding background.
[0043] Step T3: Obtain scene depth information based on the MiDaS monocular depth estimation algorithm and determine whether the target object is missing. If so, complete the target object and execute step T4; otherwise, execute step T4.
[0044] In this embodiment, task prompts such as removal and completion guide the diffusion model to complete the semantic generation task for the missing parts of the target object, thereby processing the missing parts. Furthermore, this embodiment completes the missing parts of the target object using an image containing a frontal view of the target object.
[0045] Figure 4 This is a schematic diagram of the target object before and after movement in an embodiment of the present invention.
[0046] like Figure 4 As shown, (a) is a schematic diagram of the target object before it is moved, and (b) is a schematic diagram of the target object after it is moved. The white rectangle in (a) and (b) represents the target object, which is a duck. It can be seen that in the image before the target object is moved, only part of the target object is displayed, i.e., there are missing parts. After the target object is moved, it is filled in, and a complete target object is generated, which can clearly show the shape characteristics of the whole duck.
[0047] Step T4: Adjust the size of the target object based on the specified location and scene depth information.
[0048] In step T4, the size of the target object is adjusted so that the perspective relationship of the target object at the specified position is consistent with that of the target object at the original position.
[0049] Step T5: Use a local fusion algorithm to perform background fusion on the target object at the specified location to obtain the edited image.
[0050] In step T5, background blending includes illuminating, edge processing, and shadow generation of the target object at the specified location, thereby ensuring that the target object blends seamlessly at the specified location and enhancing realism.
[0051] Figure 5 This is a schematic diagram of the image editing model pre-training process in an embodiment of the present invention.
[0052] like Figure 5 As shown, the pre-training process of an image editing model includes the following steps:
[0053] Step U1: Use the existing COCO dataset and iHarmony4 dataset as training datasets, and construct a diffusion model based on the existing Stable Diffusion v2.0 model. In this embodiment, the COCO dataset provides the target segmentation mask, the iHarmony4 dataset provides the discordant image pairs required for the local fusion task, and the Stable Diffusion v2.0 model serves as the base model, loaded with corresponding pre-trained weights.
[0054] Step U2: Based on the training dataset and using task reversal techniques, the diffusion model is trained for target removal and target completion respectively. The parameters of U-Net in the diffusion model are optimized using the AdamW optimizer, and the weights are adjusted to better generate and fuse the target object with the background.
[0055] In this embodiment, for the "target removal" task, a mask with the same shape as the target object is generated, and these masks are used to mark the object's position in the image. The model learns how to generate a seamless background area by filling the area where the target was removed with background content. For the "target completion" task, a mask of the partially occluded object is generated. During the learning process, the model uses these masks to fully reconstruct the object, ensuring that the completed object maintains consistency and integrity in its new position.
[0056] Step U3: Repeat step U2 until the denoising objective function loss value converges, then the trained diffusion model is obtained as the image editing model.
[0057] The edited reference visual image generation module 15 is used to obtain the edited reference visual image based on the 3D Gaussian model.
[0058] The operation step generation module 16 includes a multimodal model, which is used to input the pre-edit reference view image and the post-edit reference visual image into the multimodal model, and generate multiple operation steps through multimodal thought chain reasoning. In this embodiment, the multimodal model is a pre-trained multimodal model GPT-4o.
[0059] The grasping control module 17 is used to control the grasping module 11 to grasp the target object and place it at a specified position based on all operation steps, the reference view image before editing, and the reference visual image after editing. In this embodiment, the grasping control module 17 includes a target detection model and an object pose estimation model. The target detection model is the Grounded-Light-HQSAM model, and the object pose estimation model is the MVPoseNet6D model.
[0060] Figure 6 This is a flowchart of an embodiment of the present invention showing how the grasping control module controls the grasping module to grasp the target object and place it at a specified position.
[0061] like Figure 6 As shown, the process by which the grasping control module 17 controls the grasping module 11 to grasp the target object and place it at a designated position includes the following steps:
[0062] Step S6-1: Convert all operation steps into a behavior tree.
[0063] Step S6-2: Generate 6D pose data and 3D dimension data based on the reference view image before editing, the reference visual image after editing, and the object pose estimation model.
[0064] In step S6-2, the reference view image before editing and the reference visual image after editing are input into the target detection model to generate features containing the detection box and label of the target object. These features are then input into the object pose estimation model to generate 6D pose data and 3D dimension data.
[0065] In this embodiment, both the reference view image before editing and the reference visual image after editing are RGB images and depth images, i.e., RGB-D sequences.
[0066] In this embodiment, the 3D size data plays an auxiliary role in grasping the target object. Specifically, the 3D size data is used by the robotic arm of the grasping module 11 to control the opening distance of the gripper when grasping the target object. If the opening distance exceeds the preset maximum distance, an error is reported.
[0067] Step S6-3: Based on the task planning in the behavior tree, and combining 6D pose data and 3D size data, control the grasping module 11 to grasp the target object and place it at the specified position.
[0068] In step S6-3, after the target object is located at the specified position, the change in the scene state is synchronously updated to the behavior tree.
[0069] The main control module 18 stores the control program that controls the operation of the above modules.
[0070] The following description, in conjunction with the accompanying drawings, explains the process of using the grasping and placing device 100 to perform a grasping and placing method based on image editing and multimodal thought chain reasoning.
[0071] Figure 7 This is a flowchart illustrating the grasping and placement method based on image editing and multimodal thought chain reasoning in an embodiment of the present invention.
[0072] like Figure 7 As shown, the grasping and placement method based on image editing and multimodal thought chain reasoning includes the following steps:
[0073] Step S1: Use the 3D model building module 12 to obtain the RGB sequence of the target object with camera pose, and build a 3D Gaussian model of the scene based on the RGB sequence.
[0074] Step S2: The pre-editing reference viewpoint image generation module 13 selects multiple multi-view images from the RGB sequence as pre-editing reference viewpoint images.
[0075] Step S3: The 3D model update module 14 is used to edit the target objects in each pre-edit reference view image, set the target objects in the specified positions, obtain the corresponding edited images, and update the 3D Gaussian model based on all edited images.
[0076] Step S4: The edited reference visual image is obtained by the edited reference visual image generation module 15 based on the 3D Gaussian model.
[0077] Step S5: The operation step generation module 16 inputs the reference view image before editing and the reference visual image after editing into the multimodal model, and generates multiple operation steps through multimodal thinking chain reasoning.
[0078] Step S6: The grasping control module 17 controls the grasping device, i.e. the grasping module 11, to grasp the target object and place it at the designated position based on all operation steps, the reference view image before editing, and the reference visual image after editing.
[0079] The role and effect of the embodiments
[0080] According to the grasping and placement method and apparatus based on image editing and multimodal thought chain reasoning involved in this embodiment, on the one hand, by using image editing and 3D Gaussian model reconstruction, the simulation and consistency of target object motion are achieved, effectively reducing geometric errors and inconsistencies, especially when dealing with complex object shapes and reflective surfaces, exhibiting higher accuracy and stability; on the other hand, by using multimodal thought chain reasoning and behavior tree generation technology, based on the scenes before and after image editing, a task planning sequence is automatically generated and converted into an executable behavior tree, which not only significantly improves the efficiency of task planning but also enhances the system's adaptability in complex scenes. In summary, this method can achieve accurate target object grasping and placement in complex scenes.
[0081] Furthermore, the multimodal model is a pre-trained model, enabling efficient manipulation in variable environments and complex tasks. Therefore, it can perform automated reasoning by combining real-world and generated scenes, dynamically update operational strategies, and maintain precise manipulation of target objects without frequent scene reconstruction. This significantly improves operational robustness in dynamic environments.
[0082] Those skilled in the art should understand that this invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to this invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the invention as claimed. The scope of protection of this invention is defined by the appended claims and their equivalents.
Claims
1. A method for image editing and multi-modal thought chain reasoning-based pick and place, for controlling a picking device to pick a target object and place it at a specified location, characterized in that, Includes the following steps: Step S1: Obtain the RGB sequence of the target object with camera pose, and construct a 3D Gaussian model of the scene based on the RGB sequence; Step S2: Select multiple multi-view images from the RGB sequence as reference view images before editing; Step S3: Edit the target object in each of the pre-edit reference view images, set the target object in the specified position to obtain the corresponding edited image, and update the 3D Gaussian model according to all the edited images; Step S4: Obtain the post-edit reference visual image corresponding to each of the pre-edit reference viewpoint images based on the 3D Gaussian model; Step S5: Input the reference view image before editing and the reference visual image after editing into the multimodal model, and generate multiple operation steps through multimodal thinking chain reasoning; Step S6: Based on all the operation steps, the pre-editing reference view image, and the post-editing reference visual image, control the gripping device to grip the target object and place it at the designated position.
2. The image editing and multi-modal thought chain inference based pick and place method of claim 1, Its features are: In step S3, the reference viewpoint image before editing is input into the image editing model to obtain the corresponding edited image. The process of the image editing model editing the reference view image before editing includes the following steps: Step T1: Segment the reference view image before editing and identify the target object; Step T2: Move the target object to the designated position and fill in the original position of the target object; Step T3: Obtain scene depth information according to the MiDaS monocular depth estimation algorithm, and determine whether the target object is missing. If so, complete the target object and execute step T4; otherwise, execute step T4. Step T4: Adjust the size of the target object according to the specified location and the scene depth information; Step T5: The target object at the specified location is subjected to background fusion using a local fusion algorithm to obtain the edited image.
3. The grasping and placing method based on image editing and multimodal thought chain reasoning according to claim 2, characterized in that: wherein In step T1, based on the user's selection of the target object, the target object is identified by segmenting the pre-editing reference view image using the HQ-SAM model. When the boundary of the target object is blurred, the ViTMatte matting algorithm is used for precise segmentation to obtain a clear boundary of the target object.
4. The grasping and placing method based on image editing and multimodal thought chain reasoning according to claim 2, characterized in that: wherein In step T4, the size of the target object is adjusted so that the perspective relationship of the target object at the specified position is consistent with that of the target object at the original position.
5. The grasping and placing method based on image editing and multimodal thought chain reasoning according to claim 2, characterized in that: wherein, In step T2, the original position is filled in to allow it to blend naturally with the surrounding background. In step T5, the background blending includes illuminating, edge processing, and shadow generation of the target object at the specified location.
6. The grasping and placing method based on image editing and multimodal thought chain reasoning according to claim 1, characterized in that: wherein Step S6 includes the following sub-steps: Step S6-1: Convert all the operation steps into a behavior tree; Step S6-2: Generate 6D pose data and 3D size data based on the pre-edit reference view image, the post-edit reference visual image, and the object pose estimation model; Step S6-3: Based on the task planning in the behavior tree, and combining the 6D pose data and the 3D size data, control the grasping device to grasp the target object and place it at the designated position.
7. The grasping and placing method based on image editing and multimodal thought chain reasoning according to claim 6, characterized in that: wherein In step S6-2, the reference view image before editing and the reference visual image after editing are input into the target detection model to generate features containing the detection box and label of the target object. The features are input into the object pose estimation model to generate the 6D pose data and the 3D size data.
8. The grasping and placing method based on image editing and multimodal thought chain reasoning according to claim 6, characterized in that: wherein, In step S6-3, after the target object is located at the designated position, the change in scene state is synchronously updated to the behavior tree.
9. A grasping and placing device based on image editing and multimodal thought chain reasoning, used to grasp a target object and place it at a designated location, characterized in that, include: A grasping module is used to grasp the target object and move the target object; The 3D model building module is used to obtain the RGB sequence of the target object with camera pose, and to build a 3D Gaussian model of the scene based on the RGB sequence. The pre-editing reference viewpoint image generation module is used to select multiple multi-view images from the RGB sequence as pre-editing reference viewpoint images; The 3D model update module is used to edit the target object in each of the pre-edit reference view images, set the target object in the specified position, obtain the corresponding edited image, and update the 3D Gaussian model according to all the edited images; The post-edit reference visual image generation module is used to obtain post-edit reference visual images corresponding to each of the pre-edit reference viewpoint images based on the 3D Gaussian model. The operation step generation module includes a multimodal model, which is used to input the pre-edit reference view image and the post-edit reference visual image into the multimodal model and generate multiple operation steps through multimodal thinking chain reasoning. The grasping control module is used to control the grasping module to grasp the target object and place it at the specified position based on all the operation steps, the pre-editing reference view image and the post-editing reference visual image.