Information processing method, information processing device, and program

The method generates real and virtual contour images using graph neural networks and vision transformers to accurately estimate displacement and pose of deformable objects, addressing the challenge of precise pose estimation during deformation.

WO2026126730A1PCT designated stage Publication Date: 2026-06-18HONDA MOTOR CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HONDA MOTOR CO LTD
Filing Date
2025-11-13
Publication Date
2026-06-18

Smart Images

  • Figure JP2025039809_18062026_PF_FP_ABST
    Figure JP2025039809_18062026_PF_FP_ABST
Patent Text Reader

Abstract

An information processing method according to an embodiment of the present invention comprises: acquiring an image of a target object to which a force has been applied from a robot during operation; generating, from the image, an actual contour image representing a contour of the target object; generating, on the basis of at least the force applied to the target object and a pre-deformation 3D shape which is a three-dimensional shape of the target object before the force is applied, a virtual contour image representing the contour of the target object after the force has been applied; and estimating, on the basis of the actual contour image, the virtual contour image, and the force applied to the target object, a displacement amount between the pre-deformation 3D shape and a post-deformation 3D shape which is a three-dimensional shape of the target object after the force has been applied.
Need to check novelty before this filing date? Find Prior Art

Description

Information Processing Method, Information Processing Apparatus, and Program 【0001】 The present invention relates to an information processing method, an information processing apparatus, and a program. This application claims priority based on Japanese Patent Application No. 2024-216291 filed on December 11, 2024, and incorporates its content herein by reference. 【0002】 Techniques are known for constructing a data structure in which the shape of an object is stored based on a plurality of images, and obtaining the pose of the object by tracking the pose of the object through the plurality of images based on the data structure (see, for example, Patent Document 1). 【0003】 U.S. Patent Application Publication No. 2024 / 0169563 【0004】 For example, when a flexible object such as an Ethernet cable is grasped by a human or a robot, the object may stretch, contract, bend, twist, dent, collapse, or bend depending on the grasping method. That is, the object may deform. When the object thus deforms, it becomes difficult to accurately estimate the pose of the object even if the above-described pose estimation technique is used. 【0005】 The present invention has been made in consideration of such circumstances, and an object of the present invention is to provide an information processing method, an information processing apparatus, and a program that can accurately estimate the amount of displacement even when the target object is deformed by an operation, and as a result, can accurately estimate the pose of the target object. 【0006】The information processing method, information processing apparatus, and program according to the present invention employ the following configuration. (1) A first example of the present invention is an information processing method using a computer, which includes: acquiring an image of an object to which a force is applied when the robot is operated by the robot; generating a real contour image representing the contour of the object from the image; generating a virtual contour image representing the contour of the object after the force is applied by the robot, based at least on the force applied by the robot to the object and the pre-deformation 3D shape, which is the three-dimensional shape of the object before the force is applied by the robot; and estimating the amount of displacement between the pre-deformation 3D shape and the post-deformation 3D shape, which is the three-dimensional shape of the object after the force is applied, based on the real contour image, the virtual contour image, and the force applied by the robot to the object. 【0007】 (2) A second example of the present invention is that, in the first example, the robot is provided with an end effector capable of gripping the target object, and the force includes a gripping force applied from the end effector to the target object when the end effector grips the target object. 【0008】 (3) A third example of the present invention is that, in addition to the force applied to the target object by the robot and the pre-deformation 3D shape in the second example, the post-deformation 3D shape of the target object is generated based on the orientation of the end effector and the three-dimensional shape of the end effector, and the virtual contour image is generated based on the orientation of the camera that captured the target object when the force was applied to the target object by the robot and the generated post-deformation 3D shape. 【0009】(4) A fourth example of the present invention is that, in the third example, a first graph neural network combining a first encoder and a first decoder is used to generate the virtual contour image, the first encoder outputs a first latent variable in response to inputs of the force, the pre-deformation 3D shape, the orientation of the end effector, and the 3D shape of the end effector, and the first decoder outputs the post-deformation 3D shape in response to inputs of the first latent variable. 【0010】 (5) A fifth example of the present invention is the fourth example, wherein the first decoder is a second encoder different from the first encoder, which is trained as a second graph neural network in combination with a second training encoder, and a training dataset is used to train the second graph neural network, the training dataset includes a dataset labeled with output data including an image of the reference object, a contour image representing the contour of the reference object, and the mesh of the reference object, for input data including initial values ​​for a mesh representing the three-dimensional shape of a training reference object, initial values ​​for parameters relating to the elasticity or plasticity of the reference object, a texture map of the reference object, and the pose of a camera that captured the reference object. 【0011】 (6) A sixth example of the present invention further includes, in the fifth example, the second encoder outputting a second latent variable in response to the input data being input, the first decoder outputting the mesh of the reference object in response to the input of the second latent variable, rendering the mesh output by the first decoder to generate an image of the reference object, a contour image of the reference object, and the mesh of the reference object, and adjusting the parameters of the second graph neural network such that the difference between the generated image, contour image, and mesh and the ground truth image, contour image, and mesh included in the output data is reduced. 【0012】(7) A seventh example of the present invention further includes controlling the end effector in the second example such that the amount of displacement is reduced. 【0013】 (8) An eighth example of the present invention is an information processing device comprising: an acquisition unit that acquires an image of an object to which a force is applied when the robot is being operated by the robot; a first generation unit that generates a real contour image representing the contour of the object from the image; a second generation unit that generates a virtual contour image representing the contour of the object after the force is applied by the robot, based at least on the force applied by the robot to the object and the pre-deformation 3D shape, which is the three-dimensional shape of the object before the force is applied by the robot; and an estimation unit that estimates the amount of displacement between the pre-deformation 3D shape and the post-deformation 3D shape, which is the three-dimensional shape of the object after the force is applied, based on the real contour image, the virtual contour image, and the force applied by the robot to the object. 【0014】 (9) A ninth example of the present invention is a program to be executed by a computer, which includes: acquiring an image of an object to which a force is applied when the robot is being operated by the robot; generating a real contour image representing the contour of the object from the image; generating a virtual contour image representing the contour of the object after the force is applied by the robot, based at least on the force applied by the robot to the object and the pre-deformation 3D shape, which is the three-dimensional shape of the object before the force is applied by the robot; and estimating the amount of displacement between the pre-deformation 3D shape and the post-deformation 3D shape, which is the three-dimensional shape of the object after the force is applied, based on the real contour image, the virtual contour image, and the force applied by the robot to the object. 【0015】 As shown in the example above, even if the target object is deformed by the manipulation, the amount of displacement can be estimated with high accuracy, and furthermore, the pose of the target object can also be estimated with high accuracy. 【0016】This figure schematically represents the appearance of the robot included in the operation control system in the embodiment. This is a configuration diagram of the operation control system in the embodiment. This is a flowchart showing a series of processing steps of the second processing unit in the embodiment. This figure shows an example of an estimation model in the embodiment. This figure is for explaining the displacement vector. This figure is for explaining how to operate an object according to the displacement. This figure is for explaining the training of the physics graph network. 【0017】 Hereinafter, embodiments of the information processing method, information processing apparatus, and program of the present invention will be described with reference to the drawings. 【0018】 [Appearance of the robot] Figure 1 is a schematic diagram showing the appearance of the robot 10 included in the operation control system 1 in the embodiment. The robot 10 is typically a humanoid robot that can grasp or manipulate an object OB by an end effector 11, but is not limited to this, and may be any type of robot that can grasp or manipulate an object OB. For example, the robot 10 may be a quadrupedal animal-type robot, an industrial robot, a military robot, or any other type of robot. 【0019】 Object OB can be a soft object such as an Ethernet cable, a plastic bottle, a paper carton, a tube (such as a toothpaste tube or a facial cleanser tube), rubber, a spring, cardboard, or food. When object OB is manipulated by the end effector 11, force is applied to object OB from the end effector 11. This causes deformation in object OB, such as stretching, contracting, bending, twisting, denting, crushing, or bending. Object OB is an example of a "target object". 【0020】 The end effector 11 is also called a robot hand or manipulator. The end effector 11 may be equipped with several fingers (such as a thumb, index finger, middle finger, and ring finger) as grippers. 【0021】The end effector 11 is equipped with multiple tactile sensors 13, multiple force sensors 14, multiple posture sensors 15, and the like. 【0022】 The tactile sensors 13 are distributed and arranged on the palm of the end effector 11, for example. Specifically, a total of 224 tactile sensors 13 may be arranged on the palm of the end effector 11. In other words, the tactile sensors 13 may detect the force applied to the palm with a total of 224 channels. Each channel of the tactile sensor 13 is called a tactile pixel, also known as a taxel. 【0023】 The force sensors 14 are, for example, placed at the fingertips of the end effector 11 and detect the force (load) applied to each fingertip in three axes (X, Y, Z) and the moment (torque) around each axis. For example, if the end effector 11 is equipped with a thumb, index finger, middle finger, and ring finger, one force sensor 14 is placed on each finger, and force and moment are detected in a total of 4 × 6 = 24 channels. 【0024】 The posture sensors 15 are, for example, placed on each finger of the end effector 11 and detect the posture of each finger. The posture detected by the posture sensors 15 is typically the joint angle of each finger, but is not limited to this; it may also be the angular velocity or torque of the joint angle, or a combination thereof. In the following explanation, we will use the example that the posture detected by the posture sensors 15 is the joint angle. 【0025】 For example, if the end effector 11 is equipped with a thumb, index finger, middle finger, and ring finger, and each finger is further equipped with four joints, the posture sensor 15 will detect joint angles in a total of 4 x 4 = 16 channels. 【0026】 The number of tactile sensors 13 is not limited to 224; for example, it can be any number ranging from several tens to several hundred. Similarly, the number of force sensors 14 and posture sensors 15 can also be any number. 【0027】The robot 10 may, for example, be equipped with, in addition to the end effector 11, a vision sensor 12 for imaging the external environment or workspace as seen from the robot 10, and a control device 100 for controlling the robot's movements. The robot 10 performs the target task according to the actions determined by the control device 100. 【0028】 A task is, for example, to grab an object OB with one end effector 11, transfer the object OB to the other end effector 11, or move the object OB. However, tasks are not limited to these, and any task can be set. 【0029】 The visual sensor 12 is installed on a part of the robot 10's body (typically the head). The visual sensor 12 may be, for example, a 2D camera or a 3D camera (depth camera). For example, the visual sensor 12 captures a scene in which an object OB is grasped or manipulated by the end effector 11, and transmits the RGB image of that scene (hereinafter referred to as the "real image") to the control device 100, or to an external device (e.g., a human-machine interface) via the control device 100. The real image may typically be a two-dimensional image. Note that the visual sensor 12 is not limited to a 2D camera or a 3D camera, but may also be a sensor that images the external environment by irradiating it with electromagnetic waves, such as radar or lidar. The visual sensor 12 is an example of a "camera". 【0030】 Furthermore, the above-mentioned real images may be real images generated by a visual sensor 12 installed on the robot 10, or alternatively, real images generated by an external camera (not shown) installed in the robot 10's workspace. The external camera installed in the robot 10's workspace may be used, for example, to perform image analysis such as pattern matching. The image analysis may be an image analysis that extracts the contour of an object OB from the real image. The external camera is another example of a "camera". 【0031】The control device 100 controls the robot 10 to perform a target task by using data indicating the detection results of various sensors (tactile sensor 13, force sensor 14, and posture sensor 15) and the visual sensor 12, which are provided on the end effector 11. 【0032】 The control device 100 may typically be mounted on the robot 10. Alternatively, instead of being mounted on the robot 10, the control device 100 may be installed at a location far from the robot 10 and remotely control the robot 10 via a network NW. The network NW includes, for example, a LAN (Local Area Network) or a WAN (Wide Area Network). 【0033】 [Configuration of the robot and control device] Figure 2 is a configuration diagram of the operation control system 1 in an embodiment. The operation control system 1 comprises, for example, a robot 10 and a control device 100. In addition to the end effector 11, vision sensor 12, tactile sensor 13, force sensor 14, and posture sensor 15 described above, the robot 10 further comprises an actuator 16 and a first processing unit 17. 【0034】 The actuator 16 drives various parts of the robot 10 (arms, fingers, legs, head, torso, waist, etc.) under the control of the first processing unit 17. The actuator 16 includes, for example, an electromagnetic motor, gears, artificial muscles, etc. 【0035】The first processing unit 17 controls the actuator 16 based on control commands generated by the control device 100. The first processing unit 17 is implemented, for example, by a CPU (Central Processing Unit) or GPU (Graphics Processing Unit) executing a program. The first processing unit 17 may be implemented by hardware such as an LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), or SOC (System On Chip), or by the cooperation of software and hardware. 【0036】 The control device 100 includes, for example, a communication interface 110, a second processing unit 120, and a storage unit 130. 【0037】 The communication interface 110 communicates with the robot 10 via a communication line such as a bus, or with external devices via a network NW. The communication interface 110 includes, for example, a wireless communication module including a receiver and a transmitter, or a NIC (Network Interface Card). 【0038】 The second processing unit 120 includes, for example, an acquisition unit 121, a first generation unit 122, a second generation unit 123, an estimation unit 124, a remote control unit 125, and a machine learning unit 126. 【0039】 The components of the second processing unit 120 are realized, for example, by a CPU or GPU executing a program stored in the memory unit 130. Some or all of these components may be realized by hardware such as an LSI, ASIC, FPGA, or SOC, or by the cooperation of software and hardware. 【0040】The storage unit 130 is implemented by, for example, an HDD (Hard Disc Drive), a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), a ROM (Read Only Memory), a RAM (Random Access Memory), or the like. The storage unit 130 stores model construction data in addition to various programs (also referred to as instructions) such as firmware and application programs. 【0041】 The model construction data is data (program or algorithm) that defines several machine learning models for estimating the pose (posture) of the object OB. For example, the estimation model MDL described later is defined in the model construction data. The model construction data may be installed in the storage unit 130 from an external device via the network NW, or may be installed in the storage unit 130 from a portable storage medium connected to the drive device of the control device 100. 【0042】 [Processing Flow] Hereinafter, the processing content of the second processing unit 120 will be described using a flowchart. FIG. 3 is a flowchart showing the flow of a series of processes of the second processing unit 120 in the embodiment. The processes of this flowchart may be repeatedly executed until the target task is achieved. 【0043】 First, when the object OB is operated by the end effector 11, the acquisition unit 121 acquires a real image of the object OB to which a force is applied from the end effector 11 (step S100). 【0044】 For example, when the object OB is gripped by the end effector 11, a force (so-called gripping force) is applied to the object OB from each finger of the end effector 11. The acquisition unit 121 acquires a real image of the object OB to which such a gripping force is applied. 【0045】As described above, when the object OB is manipulated by the end effector 11, the visual sensor 12 captures an image of the scene. Therefore, the acquisition unit 121 may acquire a real image of the object OB from the visual sensor 12 of the robot 10 via the communication interface 110. 【0046】 Also, the scene where the object OB is manipulated by the end effector 11 may be imaged by an external camera instead of the visual sensor 12. In such a case, the acquisition unit 121 may acquire a real image of the object OB from the external camera via the communication interface 110. 【0047】 Next, the first generation unit 122 generates a two-dimensional image (hereinafter referred to as a 2D real contour image) representing the contour of the object OB from the acquired real image (step S102). The 2D real contour image is an example of the "real contour image". 【0048】 Next, the second generation unit 123 generates an image (hereinafter referred to as a 2D virtual contour image) representing the contour of the object OB after the force is applied, based at least on the force applied to the object OB such as the gripping force and the mesh (hereinafter referred to as the pre-deformation 3D mesh) showing the three-dimensional shape of the object OB before the force is applied. The 2D virtual contour image is an example of the "virtual contour image". The pre-deformation 3D mesh is an example of the "pre-deformation 3D shape". 【0049】 Next, the estimation unit 124 estimates the displacement amount of the object OB based on the 2D real contour image, the 2D virtual contour image, and the force applied to the object OB (step S106). 【0050】 The displacement amount of the object OB is an index representing the difference in displacement between the pre-deformation 3D mesh of the object OB and the mesh (hereinafter referred to as the post-deformation 3D mesh) showing the three-dimensional shape of the object OB after the force is applied. For example, the displacement amount of the object OB may be represented by a vector. Hereinafter, the vector representing the displacement amount of the object OB is referred to as a displacement amount vector. The post-deformation 3D mesh is an example of the "post-deformation 3D shape". 【0051】Next, the remote control unit 125 generates a command to control each actuator 16 of the robot 10 based on the displacement of the object OB, and transmits this command to the robot 10 via the communication interface 110 (step S108). 【0052】 Specifically, the remote control unit 125 may generate a command that causes the end effector 11 to manipulate object OB in such a way that the amount of displacement is reduced. 【0053】 When the robot 10's first processing unit 17 receives a command from the control device 100, it controls the actuator 16 based on that command. As a result, the object OB is manipulated by the end effector 11 so that the amount of displacement is reduced. This completes the processing of this flowchart. 【0054】 [Estimation Model Architecture] The 2D real contour image, 2D virtual contour image, and object OB displacement amounts described above are derived using the estimation model MDL. The architecture of the estimation model MDL is described below. Figure 4 is a diagram showing an example of the estimation model MDL in the embodiment. 【0055】 The estimated model MDL includes, for example, the encoder EN3 for object detection, the PhysicsGraphNet GN, and the VisionTransformer ViT. The PhysicsGraphNet GN is an example of a "first-generation graph neural network." 【0056】 The first generation unit 122 inputs the acquired real image (an image of object OB captured when force is applied) to the object detection encoder EN3 (third encoder). The object detection encoder EN3 outputs a 2D real contour image in response to the input of the real image. 【0057】A physics-graph-net GN is a physics-based machine learning (physics-ML) model implemented by a graph neural network (GNN) that can represent the vertices (nodes), edges, and the entire mesh of a mesh using feature vectors. For example, a physics-graph-net GN includes a first encoder EN1 and a first decoder DE1. 【0058】 The second generation unit 123 inputs the following to the first encoder EN1: (i) a force applied to object OB (such as a gripping force), (ii) the pose of the end effector 11, (iii) a 3D mesh showing the three-dimensional shape of the end effector 11, and (iv) the 3D mesh of object OB before deformation. In addition to (i)-(iv), the second generation unit 123 may also input an external force to the first encoder EN1. An external force is a force that changes the shape (mesh shape) of object OB, such as gravity. 【0059】 The pose of the end effector 11 may be the joint angle of the end effector 11 detected by the attitude sensor 15 when force is applied from the end effector 11 to the object OB (i.e., when it is grasped by the end effector 11). The mesh of the end effector 11 and the pre-deformation 3D mesh of the object OB may be predetermined. 【0060】 The first encoder EN1 outputs a first latent variable in response to the inputs (i) to (iv). 【0061】 The first decoder DE1 receives the first latent variable output from the first encoder EN1. The first decoder DE1 outputs the deformed 3D mesh of object OB in response to the input of the first latent variable. 【0062】 The second generation unit 123 uses the camera pose to convert the deformed 3D mesh of object OB output from the first decoder DE1 into a 2D mesh, and then generates a 2D virtual contour image from that 2D mesh. 【0063】Camera pose refers to the orientation of the visual sensor 12 (or external camera) when a scene in which object OB is manipulated by the end effector 11 is captured by the visual sensor 12 (or external camera). 【0064】 The Vision Transformer ViT is a transformer designed for computer vision and includes, for example, a second encoder EN2 and a second decoder DE2. 【0065】 The estimation unit 124 receives the following inputs to the second encoder EN2: (i) the force (such as gripping force) or external force (such as gravity) applied to the object OB, (v) a 2D real contour image, and (vi) a 2D virtual contour image. 【0066】 The second encoder EN2 outputs a second latent variable depending on the inputs (i), (v), and (vi). 【0067】 The second decoder DE2 receives the second latent variable output from the second encoder EN2. In response to the input of the second latent variable, the second decoder DE2 outputs a displacement vector, and in addition to this, the pose of object OB and the deformed 3D mesh. In this way, the displacement vector, the pose of object OB, and the deformed 3D mesh are estimated by utilizing the estimation model MDL, which includes the physics graph network GN and the vision transformer ViT. 【0068】Figure 5 is a diagram illustrating the displacement vector. In the figure, M represents the deformed 3D mesh of object OB estimated using the estimation model MDL. V1 to V4 represent the displacement vectors. If object OB is made of a soft material, the part gripped by the end effector 11 (the part grasped by the fingers) expands and contracts and deforms, as shown in the figure. Therefore, the amount of deformation that occurs after being gripped by the end effector 11 compared to before being gripped by the end effector 11 is estimated as the displacement vector. As described above, the remote control unit 125 generates commands to control the end effector 11 so that the displacement vector V becomes smaller. As a result, the end effector 11 readjusts its grip on object OB or re-gripping it so that the displacement vector V becomes smaller. 【0069】 Figure 6 illustrates how to operate object OB according to the amount of displacement. In the illustrated example, object OB is an Ethernet cable. For example, if the middle of the Ethernet cable is grasped, the Ethernet cable will bend, and the robot 10 will not be able to insert the terminal into the port. In such a case, a displacement vector V is estimated, and the Ethernet cable is re-grabbed so that the displacement vector V becomes smaller. As a result, the robot 10 will be able to insert the terminal of the Ethernet cable into the port. 【0070】 [Training the Estimation Model] The training of the Physics Graph Network GN included in the estimation model MDL will be explained below. Figure 7 is a diagram illustrating the training of the Physics Graph Network GN. The first decoder DE1 of the Physics Graph Network GN is combined with the fourth encoder EN4, which is prepared for training, to be trained as the second Physics Graph Network GN-2. The second Physics Graph Network GN-2 is an example of a "second graph neural network," and the fourth encoder EN4 is an example of a "second encoder." 【0071】The second physics graph network GN-2, which is composed of a first decoder DE1 shared with the physics graph network GN and a fourth encoder EN4, is trained using a training dataset that includes a dataset in which a certain output data is labeled for a certain input data. 【0072】 The input data includes initial values ​​for the mesh representing the 3D shape of the training reference object, initial values ​​for the deformation parameters of the reference object, a texture map of the reference object generated from a real image of the reference object, and the pose of the camera that captured the reference object. The reference object, like object OB, may be a soft object. 【0073】 Deformation parameters are parameters relating to the physical properties of a reference object, such as elasticity and plasticity. Examples may include elastic modulus, plasticity rate, density, coefficient of restitution, yield strength, and fracture toughness. 【0074】 The output data includes the ground truth 2D image, 2D contour image, and 3D mesh of the reference object. 【0075】 First, the machine learning unit 126 of the control device 100 inputs the input data included in the training dataset to the fourth encoder EN4 of the second physics graph net GN-2. 【0076】 The fourth encoder EN4 outputs a second latent variable in response to the input data (initial mesh values, initial deformation parameter values, texture map, and camera pose). 【0077】 The first decoder DE1 of the second physics graph network GN-2 receives the second latent variable output from the fourth encoder EN4. In response to the input of the second latent variable, the first decoder DE1 outputs a 3D mesh of the reconstructed reference object. 【0078】The machine learning unit 126 generates a 2D image of the reference object and a 2D contour image of the reference object by rendering the 3D mesh of the reference object output from the first decoder DE1. The 2D image of the reference object includes the shading and texture of the object's bumps and depressions in two dimensions. 【0079】 Then, the machine learning unit 126 adjusts the parameters (such as weight coefficients and bias components) of the second physics graph network GN-2 based on the image of the reference object (2D RGB image) and the 2D contour image of the reference object generated by rendering, the 3D mesh of the reference object reconstructed by the first decoder DE1, and the 2D image, 2D contour image, and 3D mesh of the ground truth (target) reference object included as output data in the training dataset. 【0080】 Specifically, the machine learning unit 126 calculates the loss according to formula (1) and adjusts the parameters of the second physics graph network GN-2 so that the loss is minimized. 【0081】 【0082】 As shown in equation (1), the loss includes photometric loss, silhouette loss, vertex position loss, edge length loss, face area loss, and normal vector loss. 【0083】Photometric loss is a loss term to facilitate the matching of the reconstructed 3D mesh with the RGB data of the input image. Silhouette loss is a loss term to facilitate the matching of the foreground segmentation mask of the input image frame with the reconstructed 2D surface. Vertex position loss is a loss term to ensure that the vertex positions of the 3D mesh match the vertex positions of the target mesh. Edge length loss is a loss term to ensure that the edge lengths of the 3D mesh match the edge lengths of the target mesh. Area loss is a loss term to ensure that the area of ​​each face of the 3D mesh matches the area of ​​the target mesh. Normal vector loss is a loss term to ensure that the normal vector of each vertex of the 3D mesh matches the normal vector of the target mesh. α, β, γ, σ, ε, ζ are arbitrary weighting coefficients for each loss term. 【0084】 Once the parameters of the second physics graph network GN-2 are adjusted based on the above losses, the first decoder DE1 included in the second physics graph network GN-2, i.e., the trained first decoder DE1, is combined with the first encoder EN1 and used as the physics graph network GN. 【0085】 According to the embodiments described above, the second processing unit 120 of the control device 100 acquires a real image of the object OB to which force is applied from the end effector 11 when the robot 10 is operated by the end effector 11, and uses the encoder EN3 for object detection to generate a 2D real contour image representing the contour of the object OB from the real image. 【0086】 Furthermore, the second processing unit 120 uses the physics graph net GN to generate a deformed 3D mesh showing the 3D shape of object OB after the force is applied, from (i) the force applied to object OB (such as a gripping force), (ii) the pose of the end effector 11, (iii) a 3D mesh representing the 3D shape of the end effector 11, and (iv) a pre-deformation 3D mesh representing the 3D shape of object OB before the force is applied. 【0087】Furthermore, the second processing unit 120 uses the camera pose to convert the deformed 3D mesh of object OB into a 2D mesh, and then generates a 2D virtual contour image from that 2D mesh. 【0088】 Furthermore, the second processing unit 120 uses the Vision Transformer ViT to estimate a displacement vector from (i) the force applied to object OB (such as gripping force), (v) a 2D real contour image, and (vi) a 2D virtual contour image, and in addition to this, the pose of object OB and the deformed 3D mesh. With this configuration, even if object OB is deformed by the operation of the end effector 11, the amount of displacement can be estimated with high accuracy, and as a result, the pose of object OB can also be estimated with high accuracy. 【0089】 In the embodiment described above, the training data for the second physics graph network GN-2 was explained as including the initial values ​​of the reference object's mesh, the initial values ​​of the reference object's deformation parameters, the reference object's texture map, and the camera pose used to image the reference object as input data. These input data may also include external forces such as gravity. By using such training data, the first decoder DE1, which is shared with the physics graph network GN, can learn the causal relationship between the deformation of object OB and external forces (e.g., gravity). 【0090】 Although embodiments for carrying out the present invention have been described above using examples, the present invention is not limited in any way to these embodiments, and various modifications and substitutions can be made without departing from the spirit of the present invention. 【0091】1...Operation control system, 10...Robot, 11...End effector, 12...Vision sensor, 13...Tactile sensor, 14...Force sensor, 15...Posture sensor, 16...Actuator, 17...First processing unit, 100...Control device, 110...Communication interface, 120...Second processing unit, 121...Acquisition unit, 122...First generation unit, 123...Second generation unit, 124...Estimation unit, 125...Remote control unit, 126...Machine learning unit, 130...Storage unit, NW...Network, MDL...Estimation model, GN...Physics graph network, ViT...Vision transformer

Claims

1. A computer-based information processing method comprising: acquiring an image of an object to which a force is applied when the robot is being operated by the robot; generating a real contour image representing the contour of the object from the image; generating a virtual contour image representing the contour of the object after the force is applied by the robot, based at least on the force applied by the robot to the object and the pre-deformation 3D shape, which is the three-dimensional shape of the object before the force is applied by the robot; and estimating the amount of displacement between the pre-deformation 3D shape and the post-deformation 3D shape, which is the three-dimensional shape of the object after the force is applied, based on the real contour image, the virtual contour image, and the force applied by the robot to the object.

2. The information processing method according to claim 1, wherein the robot is provided with an end effector capable of gripping the target object, and the force includes a gripping force applied from the end effector to the target object when the end effector grips the target object.

3. The information processing method according to claim 2, further comprising: generating the deformed 3D shape of the target object based on the pose of the end effector and the three-dimensional shape of the end effector, in addition to the force applied to the target object by the robot and the pre-deformation 3D shape; and generating the virtual contour image based on the pose of the camera that captured the target object when the force was applied to the target object by the robot and the generated deformed 3D shape.

4. The information processing method according to claim 3, wherein a first graph neural network combining a first encoder and a first decoder is used to generate the virtual contour image, the first encoder outputs a first latent variable in response to inputs of the force, the 3D shape before deformation, the orientation of the end effector, and the 3D shape of the end effector, and the first decoder outputs the 3D shape after deformation in response to inputs of the first latent variable.

5. The information processing method according to claim 4, wherein the first decoder is a second encoder different from the first encoder, and is trained as a second graph neural network in combination with a second encoder for training, and a training dataset is used to train the second graph neural network, and the training dataset includes a dataset in which output data including an image of the reference object, a contour image representing the contour of the reference object, and the mesh of the reference object is labeled for input data including initial values ​​for a mesh representing the three-dimensional shape of a training reference object, initial values ​​for parameters relating to the elasticity or plasticity of the reference object, a texture map of the reference object, and the pose of a camera that captured an image of the reference object.

6. The information processing method according to claim 5, further comprising: the second encoder outputting a second latent variable in response to the input data being input; the first decoder outputting the mesh of the reference object in response to the input of the second latent variable; generating an image of the reference object, a contour image of the reference object, and the mesh of the reference object by rendering the mesh output by the first decoder; and adjusting the parameters of the second graph neural network such that the difference between the generated image, contour image, and mesh and the ground truth image, contour image, and mesh included in the output data is reduced.

7. The information processing method according to claim 2, further comprising controlling the end effector so that the amount of displacement becomes small.

8. An information processing device comprising: an acquisition unit that acquires an image of an object to which a force is applied when the robot is being operated by the robot; a first generation unit that generates a real contour image representing the contour of the object from the image; a second generation unit that generates a virtual contour image representing the contour of the object after the force is applied by the robot, based at least on the force applied by the robot to the object and the pre-deformation 3D shape, which is the three-dimensional shape of the object before the force is applied by the robot; and an estimation unit that estimates the amount of displacement between the pre-deformation 3D shape and the post-deformation 3D shape, which is the three-dimensional shape of the object after the force is applied, based on the real contour image, the virtual contour image, and the force applied by the robot to the object.

9. A program to be executed by a computer, comprising: acquiring an image of an object to which a force is applied when the robot is being operated by the robot; generating a real contour image representing the contour of the object from the image; generating a virtual contour image representing the contour of the object after the force is applied by the robot, based at least on the force applied by the robot to the object and the pre-deformation 3D shape, which is the three-dimensional shape of the object before the force is applied by the robot; and estimating the amount of displacement between the pre-deformation 3D shape and the post-deformation 3D shape, which is the three-dimensional shape of the object after the force is applied, based on the real contour image, the virtual contour image, and the force applied by the robot to the object.