Control device, control method, and control program
The control system integrates model predictive control and reinforcement learning to address changes in powder properties during grinding, ensuring consistent robot performance by adapting to both short-term and long-term changes in the powder state.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- OMRON CORP
- Filing Date
- 2025-12-04
- Publication Date
- 2026-06-11
AI Technical Summary
Existing robotic powder grinding systems fail to adapt to changes in the properties of powders due to grinding, such as aggregation, spreading, or changes in particle shape, leading to suboptimal task performance.
A control system combining model predictive control and model-free reinforcement learning to adjust robot operations in response to short-term and long-term changes in powder state, using trained models to minimize the difference between predicted and target states.
Enables the robot to effectively adapt its actions to changing powder conditions, ensuring consistent and desired grinding results by minimizing the difference between predicted and target states.
Smart Images

Figure JP2025042400_11062026_PF_FP_ABST
Abstract
Description
Control device, control method, and control program
[0001] This disclosure relates to a control device, a control method, and a control program.
[0002] Conventionally, robotic powder grinding systems that apply mechanical force to powders are known (see, for example, Reference 1 (Nakajima, Yusaku, et al. "Force-Controlled Robotic Mechanochemical Synthesis, “Digital Discovery,” 2024.)). In this powder grinding system, the robot grinds the powder by applying force to it.
[0003] By the way, when a robot performs a task, there are some tasks that will not go well even if the robot continues to perform predetermined actions. For example, when performing a task in which the properties of objects related to the task change due to the robot's actions, it is expected that the task will not go well even if the robot is controlled to perform predetermined actions.
[0004] For example, when a robot performs an action such as grinding powder, the force the robot's end-hand receives from the powder or the friction between the robot's end-hand and the powder changes as the powder is ground. For example, as disclosed in the above-mentioned document 1, when a robot grinds powder in a mortar using a pestle it holds, as the grinding of the powder progresses, the powder may locally aggregate, spread out within the mortar, or the particle shape of the powder may change, causing the friction between the mortar and the powder to change.
[0005] The powder grinding system disclosed in the above-mentioned document 1 only grinds the powder in a mortar by controlling a robot to perform pre-set actions, and does not take into account changes in the powder in the mortar.
[0006] This disclosure is made in view of the above points, and aims to perform robot control in response to changes in the state of an object when the state of the object changes due to the robot's operation.
[0007] To achieve the above objective, the control device according to this disclosure is a control device for controlling the operation of a robot, and includes: an acquisition unit for acquiring the current state data of the robot; a parameter acquisition unit for acquiring the control parameters corresponding to the current state data by inputting the current state data to a first trained model that has been reinforced in advance to output control parameters for time t when state data for time t is input; a setting unit for setting the control parameters to a second trained model that has been in advance to output predicted values of state data for time t+1 when state data for time t and action candidate data for time t are input; an action acquisition unit for acquiring predicted values of state data for the next time by inputting the current state data and action candidate data to the second trained model on which the control parameters have been set, and acquiring action candidate data as action data such that the difference between the predicted value and the target value of the state data for the next time becomes small; and a control unit for controlling the operation of the robot so that the operation represented by the action data is realized.
[0008] Furthermore, the control method disclosed herein is a control method for controlling the operation of a robot, wherein a computer executes the following process: acquires the current state data of the robot, and acquires the control parameters corresponding to the current state data by inputting the current state data to a first trained model that has been reinforced to output control parameters for time t when state data for time t is input; sets the control parameters to a second trained model that has been pre-trained to output predicted values of state data for time t+1 when state data for time t and action candidate data for time t are input; acquires predicted values of state data for the next time by inputting the current state data and action candidate data to the second trained model with the control parameters set; acquires action candidate data as action data such that the difference between the predicted value and the target value of the state data for the next time is small; and controls the operation of the robot so that the operation represented by the action data is realized.
[0009] Furthermore, the control program disclosed herein is a control program for controlling the operation of a robot, and is a control program for causing a computer to execute the following processes: acquiring the current state data of the robot, inputting the current state data to a first trained model that has been reinforced to output control parameters for time t when state data for time t is input, thereby acquiring the control parameters corresponding to the current state data; setting the control parameters to a second trained model that has been pre-trained to output predicted values of state data for time t+1 when state data for time t and action candidate data for time t are input; inputting the current state data and the action candidate data to the second trained model with the control parameters set, thereby acquiring predicted values of state data for the next time; acquiring action candidate data as action data such that the difference between the predicted value and the target value of the state data for the next time is small; and controlling the operation of the robot so that the action represented by the action data is realized.
[0010] According to this disclosure, when the state of an object changes due to the robot's movements, it is possible to perform control of the robot in accordance with the change in the state of the object.
[0011] This is a diagram illustrating the control system of this embodiment. This is a diagram illustrating the changes in the powder. This is a diagram illustrating the overview of this embodiment. This is a block diagram showing the schematic configuration of the control system of this embodiment. This is a block diagram showing the hardware configuration of the control device according to this embodiment. This is a flowchart showing the flow of the trained model generation process in this embodiment. This is a flowchart showing the flow of the control process in this embodiment. This is a diagram showing the results of this embodiment.
[0012] Hereinafter, an example of an embodiment of the present disclosure will be described with reference to the drawings. In this embodiment, a control system equipped with the control device according to the present disclosure will be described as an example. In each drawing, the same or equivalent components and parts are given the same reference numerals. Also, the dimensions and proportions in the drawings are exaggerated for illustrative purposes and may differ from the actual proportions.
[0013] Figure 1 is a diagram illustrating this embodiment. As shown in Figure 1, the control system 10 of this embodiment comprises a robot 12 and a control device 14. A pestle M, which is an example of a means for applying force to an object, is attached to the end effector 16 of the robot 12. The object is placed in the mortar C shown in Figure 1. As shown in Figure 1, the robot 12 performs the task of crushing the object in the mortar C using the mortar C by operating the end effector 16. In this embodiment, the case where the object is a powder will be described as an example.
[0014] Figure 2 is a diagram illustrating the changes in the powder. Figure 2 is an aerial view of the process of grinding powder P in a mortar C using a pestle M. T1 in Figure 2 shows the state immediately after grinding of powder P by the pestle M begins. On the other hand, T2 in Figure 2 shows the state after a predetermined time has elapsed since grinding of powder P by the pestle M began. As shown in Figure 2, when grinding powder P in a mortar C using a pestle M, at time T1 when grinding of powder P begins, the powder P has not spread very far within the mortar C. However, as shown in Figure 2, at time T2 after a predetermined time has elapsed, the powder P has spread far within the mortar C. Furthermore, as grinding of powder P by the pestle M progresses, the physical properties of powder P change. For example, the viscosity of powder P may increase or decrease as it is ground. Alternatively, for example, when powder P is pulverized, the particle size of the particles constituting powder P decreases, which may change the coefficient of friction between the pestle M and powder P.
[0015] More specifically, when the robot 12 performs the tasks described above, the dynamics of contact between the robot 12 and the powder P change in the short and long term. In the short term, as mentioned above, an uneven distribution of powder P occurs in the mortar C during the grinding process. On the other hand, in the long term, as mentioned above, the particle size of powder P decreases as the grinding progresses, and friction and other factors change. In such cases, if the robot 12 is controlled uniformly, it is conceivable that the desired results regarding the grinding of powder P may not be obtained.
[0016] Therefore, in this embodiment, when the state of an object changes due to the operation of the robot 12, control of the robot 12 is performed in accordance with the change in the state of the object. Specifically, when the state of an object changes due to the robot 12 applying force to the object, control of the robot 12 is performed in accordance with the change in the state of the object. In this embodiment, control of the robot 12 in accordance with changes in the object is realized by combining model predictive control and model-free reinforcement learning. The state of the object refers to the form or properties of the object as described above.
[0017] Model-predictive control is a control method that uses a dynamics model to predict the future and adjusts actions hour by hour. Therefore, model-predictive control is a control method that can respond to short-term changes. On the other hand, a trained model generated by model-free reinforcement learning can respond to long-term changes. Specifically, when generating a trained model by model-free reinforcement learning, the rewards that will be obtained over time are calculated when an action is performed according to the state of the robot 12 at a given time. Therefore, a trained model generated by model-free reinforcement learning is a model that can respond to long-term changes. Accordingly, in this embodiment, the parameters of the cost function of model-predictive control are adjusted using the trained model generated by model-free reinforcement learning. This makes it possible to respond to both short-term and long-term changes in the object when the state of the object changes due to the movement of the robot 12.
[0018] In this embodiment, simulation data is obtained by running a simulation in which the robot 12 performs a task of crushing powder P using a pestle M. Then, based on this simulation data, a machine learning model is trained using model-free reinforcement learning, and a dynamics model that constitutes the control model used in model predictive control is also trained. In this embodiment, the physical properties and behavior of powder P are simulated, as well as the operation of the robot 12. In this embodiment, the case in which the dynamics model is trained based on simulation data obtained by running the simulation is described as an example, but it is also possible to actually operate the robot to obtain data similar to the above simulation and train the dynamics model based on that data.
[0019] <Framework of this Embodiment> (A. System Overview) Figure 3 is a diagram illustrating the overview of the control system 10 of this embodiment. In this embodiment, Soft Actor Critic (hereinafter simply referred to as the "Actor Critic algorithm"), which is an example of a model-free reinforcement learning algorithm, is used. The Actor Critic algorithm is disclosed, for example, in the following reference 1.
[0020] Reference 1: T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning, 2018, pp.1861-1870.
[0021] Furthermore, in this embodiment, an example of a known compliance controller, FDCC (Forward Dynamics Compliance Controller), is used. As shown in Figure 3, the FDCC outputs data representing the joint positions of the robot 12. The FDCC is disclosed, for example, in Reference 2 below.
[0022] Reference 2: S. Scherzinger, A. Roennau, and R. Dillmann, “Forward dynamics compliance control (FDCC): A new approach to cartesian compliance for robotic manipulators,” in IEEE / RSJ International Conference on Intelligent Robots and Systems, 2017, pp. 4568-4575.
[0023] As shown in Figure 3, the actor, which is a reinforcement learning agent, outputs control parameters [Q, p] as the actor's own action (hereinafter also referred to as "actor action"). The control parameters [Q, p] are parameters used in model predictive control. Then, model predictive control incorporating the control parameters [Q, p] is executed, and FDCC parameters are output as the action in model predictive control (hereinafter also referred to as "control action"). The FDCC parameters are incorporated into FDCC. The FDCC parameters consist of, for example, parameters representing the displacement of the end effector 16, which is the end effector of the robot 12, and parameters representing the stiffness. FDCC and each parameter are disclosed in Reference 2, so please refer to Reference 2 for details.
[0024] As shown in Figure 3, the FDCC receives target trajectory data of the pestle M grasped by the robot 12 and target force data applied to the pestle M. The FDCC, with its built-in parameters, outputs control commands representing the joint positions of the robot 12 based on the various input data. The robot 12 then executes actions according to the control commands output from the FDCC.
[0025] As shown in Figure 3, the posture of the end effector 16 and the force and torque (or moment) acting on the robot 12 when the robot 12 is operating are detected and observed as state data. The robot is equipped with sensors at n locations to detect force and torque.
[0026] As shown in Figure 3, the state data observed at the current time is input to the FDCC and used when outputting the control command for the next time. Also, as shown in Figure 3, the state data observed at the current time is input to the model predictive control and used when outputting the FDCC parameters, which are the actions for the next time. In this embodiment, the model predictive control outputs the FDCC parameters at a control period of 20 Hz, which is lower than the control period of 500 Hz used by the FDCC.
[0027] (B. Model Predictive Control) In this embodiment, model predictive control is used to calculate parameters for FDCC. A dynamics model is used in model predictive control. The dynamics model can be implemented using a known machine learning model, for example, a neural network.
[0028] The input data for the dynamics model consists of state data at time t and candidate action data at time t, while the output data of the dynamics model is a predicted value of the state data at time t+1. The dynamics model generates a predicted value of the state data at time t+1 from the state data at time t and candidate action data at time t. By using the dynamics model, it becomes possible to predict what state will occur at time t+1 based on what action is taken at time t. For this purpose, each of the multiple candidate action data at time t is input into the dynamics model, and each of the predicted values of the state data at time t+1 corresponding to each of the multiple candidate action data is obtained. Then, the candidate action data corresponding to the predicted value that is closest to the target value is adopted as the formal action data.
[0029] Therefore, in this embodiment, the following cost function J Q The dynamics model f is used to identify the control behavior data u that minimizes (x). (1) (2)
[0030] In the above equation, x is state data and consists of the posture of the end effector 16 of the robot 12 and the force and torque acting on the robot 12. k is the time step (0 to N). u is control action data and, as described above, consists of parameters representing the displacement of the end effector 16, which is the end effector of the robot 12, and parameters representing the stiffness. The control action data u corresponds to the FDCC parameters in Figure 3 above. f is the dynamics model and consists of the state data x at time k. k and control action data u at time k k When this is input, the predicted value x of the state data at time k+1 is generated. k+1 Outputs the following: Q and p are the cost function J. Q (x) is a weight parameter, Q is a diagonal matrix, and p is a vector. Q and p correspond to the control parameters in Figure 3 above.
[0031] The strategy π(x) in the above formula (2) is a function that outputs control action data u such that the cost function J Q (x) is minimized. When identifying control action data u such that the cost function J Q (x) in the above formula (1) is minimized, the dynamics model f is utilized. Note that the state data x in the above cost function J Q (x) is the value when the target value is set to 0, and the state data x in the above formula (1) represents the difference between the state data and the target value. Therefore, the above cost function J Q (x) is a function for minimizing the difference between the actual state data and the target value as much as possible and also minimizing the control action data u as much as possible.
[0032] When training the dynamics model f, supervised learning with known labels or self-supervised learning is used. Specifically, the dynamics model f is trained based on the state data and control action data included in the simulation data as described above.
[0033] As shown in the above formula (2), the optimization method for calculating the strategy π(x) such that the cost function J Q (x) is minimized is disclosed in, for example, Reference 3 below. For details, please refer to Reference 3.
[0034] Reference 3: Amos, Brandon, et al. "Differentiable mpc for end-to-end planning and control." Advances in neural information processing systems 31 (2018).
[0035] (C. Model-Free Reinforcement Learning) In this embodiment, an actor, which is a learned model generated by model-free reinforcement learning, is used to calculate the cost function J of model predictive control QGenerate the control parameters Q and p of (x). In this embodiment, an actor-critic algorithm, which is an example of model-free reinforcement learning, is used to train an actor and a critic. Note that the actor corresponds to the policy function in reinforcement learning, and the critic corresponds to the action value function in reinforcement learning.
[0036] The state data in the actor-critic algorithm is the same as the state data in the above-mentioned model predictive control. Also, the input data of the actor is the state data, and the output data of the actor is the control parameters Q and p of the cost function J(x) described above. Q described above.
[0037] The following equation (3) is the reward function in the actor-critic algorithm. As shown in the following equation (3), the reward function includes the difference between the force F represented by the target force data and the force F detected by the six-axis force sensor attached to the robot 12, and the difference between the position x of the nipple M that can be calculated from the target trajectory data and the actual position x of the nipple M. Therefore, the smaller the difference between the force F represented by the target force data and the detected force F, the greater the reward obtained, and the smaller the difference between the position x of the nipple M that can be calculated from the target trajectory data and the actual position x of the nipple M, the greater the reward obtained. Note that w1 and w2 in equation (3) are weight parameters that are set in advance. d and the force F detected by the six-axis force sensor attached to the robot 12, and the difference between the position x of the nipple M that can be calculated from the target trajectory data d and the actual position x of the nipple M m are included in the composition. Therefore, the smaller the difference between the force F represented by the target force data d and the detected force F, the greater the reward obtained, and the smaller the difference between the position x of the nipple M that can be calculated from the target trajectory data d and the actual position x of the nipple M m the smaller the difference, the greater the reward obtained. Note that w1 in equation (3) f , w2 x are weight parameters that are set in advance.
[0038] (3)
[0039] Therefore, the reward function of the above equation (3) outputs a greater reward as the difference between the state data x of the robot 12 and the target state data of the robot becomes smaller, and outputs a greater reward as the difference between the force applied to the powder P by the operation of the robot 12 and the target force applied to the powder P becomes smaller.
[0040] <Control System 10> Figure 4 is a block diagram showing the schematic configuration of the control system 10 of this embodiment. As shown in Figure 4, the control system 10 comprises a sensor group 11, a robot 12, and a control device 14. The control device 14 generates a trained actor and a trained dynamics model in model predictive control, which are trained models for controlling the operation of the end effector 16 of the robot 12. The control device 14 also uses the generated trained models to control the operation of the end effector 16 of the robot 12.
[0041] The sensor group 11 is attached to the end effector 16 of the robot 12 and sequentially measures observation data, including the state data described above. The sensor group 11 measures the position of the pestle M, the velocity of the pestle M, the jerk / acceleration of the pestle M, the history of the robot 12's actions, and the history of the forces and torques applied to the robot 12 at n points. For example, the forces and torques applied to the robot 12 at n points are so-called force values (e.g., 3-axis torque and 3-axis force). The sensor group 11 then outputs the obtained observation data to the control device 14.
[0042] Figure 5 is a block diagram showing the hardware configuration of the control device 14 according to this embodiment. As shown in Figure 5, the control device 14 includes a CPU (Central Processing Unit) 42, memory 44, storage device 46, input / output interface (Interface) 48, storage medium reader 50, and communication interface 52. Each component is connected to the others via a bus 54 so as to be able to communicate with each other.
[0043] The storage device 46 stores a trained model generation program and a control program for executing the processes described later. The CPU 42 is a central processing unit that executes various programs and controls each component. Specifically, the CPU 42 reads a program from the storage device 46 and executes the program using memory 44 as a workspace. The CPU 42 controls each component and performs various calculations according to the program stored in the storage device 46.
[0044] Memory 44 is composed of RAM (Random Access Memory) and temporarily stores programs and data as a working area. Storage device 46 is composed of ROM (Read Only Memory), HDD (Hard Disk Drive), SSD (Solid State Drive), etc., and stores various programs including the operating system and various data.
[0045] The I / F 48 is an interface for inputting data from and outputting data to external devices. It may also be connected to various input devices, such as keyboards and mice, and output devices, such as displays and printers, for outputting various types of information. A touch panel display may be used as an output device and function as an input device.
[0046] The storage medium reader 50 reads data stored on various storage media such as CD (Compact Disc)-ROM, DVD (Digital Versatile Disc)-ROM, Blu-ray disc, and USB (Universal Serial Bus) memory, and writes data to the storage media.
[0047] Communication I / F52 is an interface for communicating with other devices, and standards such as Ethernet®, FDDI, and Wi-Fi® are used.
[0048] Next, the functional configuration of the control device 14 will be described. As shown in Figure 4, the control device 14 functionally includes a simulation unit 20, a learning unit 21, an acquisition unit 22, a parameter acquisition unit 23, a setting unit 24, an action acquisition unit 25, and a control unit 26. In addition, a data storage unit 28, a trained model storage unit 30, and a controller storage unit 32 are provided in a predetermined storage area of the control device 14. Each functional configuration is realized by the CPU 42 reading each program stored in the storage device 46, expanding it into the memory 44, and executing it.
[0049] The data storage unit 28 stores observation data detected by the sensor group 11. The data storage unit 28 also stores control data generated when the robot 12's end effector 16 operates. For example, it stores a history of operation data generated when the robot 12's end effector 16 operates.
[0050] The trained model storage unit 30 stores trained actors generated by the process described later, trained critics, and a control model used in model predictive control. The control model is a model that realizes equations (1) and (2) above, and includes a trained dynamics model.
[0051] The controller memory unit 32 stores information representing the FDCC model described above. Specifically, by substituting the FDCC parameters generated in model predictive control into the FDCC stored in the controller memory unit 32, the joint positions of the robot 12 are output.
[0052] First, the simulation unit 20 and the learning unit 21 generate a trained model for controlling the operation of the robot 12's end effector 16, which includes a trained actor, a trained critic, and a trained dynamics model used in model predictive control.
[0053] The simulation unit 20 performs a simulation in which a virtual robot pulverizes a virtual powder.
[0054] The learning unit 21 learns the actors and critics in the actor-critic algorithm using known model-free reinforcement learning based on the data obtained while the simulation unit 20 is running the simulation. The learning unit 21 also learns the dynamics model using known supervised learning or self-supervised learning based on the data obtained while the simulation unit 20 is running the simulation.
[0055] Specifically, the learning unit 21 generates trained actors and trained critics by training actors and critics according to the actor-critic algorithm so that the reward r calculated by equation (3) above is large. The learning unit 21 then stores the trained actors and trained critics in the trained model storage unit 30. The reward r can be set in advance depending on the type of task that the robot 12 performs.
[0056] Furthermore, the learning unit 21 trains the dynamics model using known supervised learning or self-supervised learning. Since the input and output data pairs of the dynamics model can be obtained when running the simulation, it is possible to train the dynamics model using known supervised learning or self-supervised learning. Therefore, the control model, which includes the trained actors and the trained dynamics model, is a model that has been trained based on data obtained by simulating a virtual space that mimics the real space.
[0057] When the trained actor, trained critic, and trained dynamics model are stored in the trained model storage unit 30, it becomes possible to control the operation of the robot 12's end effector 16 using these trained models. Therefore, the acquisition unit 22 and the control unit 26 control the operation of the robot 12's end effector 16 using the trained models stored in the trained model storage unit 30 and the FDCC stored in the controller storage unit 32. The trained actor is an example of the first trained model in this disclosure. The control model used in model predictive control is an example of the second trained model in this disclosure.
[0058] The acquisition unit 22 acquires the current status data of the robot.
[0059] The parameter acquisition unit 23 reads the trained actor stored in the trained model storage unit 30. Then, the parameter acquisition unit 23 inputs the current state data acquired by the acquisition unit 22 into the trained actor, thereby acquiring control parameters [Q, p] corresponding to the current state data.
[0060] The setting unit 24 reads out a control model that includes a trained dynamics model stored in the trained model memory unit 30. The setting unit 24 then sets the control parameters [Q, p] obtained by the parameter acquisition unit 23 to the control model. Specifically, the setting unit 24 substitutes the control parameters [Q, p] obtained by the parameter acquisition unit 23 into the Q, p part of equation (1) above.
[0061] The action acquisition unit 25 obtains a predicted value of the state data for the next time step by inputting the current state data and candidate control action data to the dynamics model of the control model for which control parameters [Q, p] have been set. Then, the action acquisition unit 25 acquires candidate control action data as formal control action data that minimizes the difference between the predicted value and the target value of the state data for the next time step. Specifically, the action acquisition unit 25 uses the cost function J according to equation (2) above. Q A policy π is determined that minimizes (x). Policy π corresponds to a formal sequence of control behavior data. Note that the candidate control behavior data is an example of the candidate behavior data in this disclosure, and the control behavior data is an example of the behavior data in this disclosure.
[0062] The control unit 26 controls the movement of the robot 12 so that the movement represented by the sequence of formal control action data acquired by the action acquisition unit 25 is realized. The sequence of formal control action data is the FDCC parameters described above. Therefore, the control unit 26 inputs the FDCC parameters, which are the sequence of formal control action data, into the FDCC stored in the controller memory unit 32, and acquires the joint positions of the robot 12 output from the FDCC.
[0063] The control unit 26 then controls the robot 12 based on the joint positions of the robot 12 output from the FDCC. Specifically, the control unit 26 outputs control signals to the robot 12 such that the joint positions of the robot 12 output from the FDCC are realized.
[0064] Next, the operation of the control system 10 according to this embodiment will be described.
[0065] When the control device 14 receives a predetermined instruction signal, the CPU 42 of the control device 14 reads the trained model generation program from the storage device 46, loads it into memory 44, and executes it. As a result, the CPU 42 functions as one of the various functional configurations of the control device 14, and the trained model generation process shown in Figure 6 is executed.
[0066] In step S100, the simulation unit 20 performs a simulation in which a virtual robot pulverizes a virtual powder.
[0067] In step S102, the learning unit 21 uses known model-free reinforcement learning to train the actor and critic in the actor-critic algorithm based on the data obtained while the simulation in step S100 is being executed.
[0068] In step S104, the learning unit 21 trains a dynamics model using known supervised learning or self-supervised learning based on the data obtained while the simulation in step S100 is being executed.
[0069] In step S106, the learning unit 21 stores the trained actor, the trained critic, and the trained dynamics model in the trained model storage unit 30.
[0070] Next, when the control device 14 receives a predetermined instruction signal, the control device 14 executes the control process shown in Figure 7. The control process in Figure 7 is executed repeatedly.
[0071] In step S200, the acquisition unit 22 acquires the current state data of the robot.
[0072] In step S202, the parameter acquisition unit 23 reads the trained actor stored in the trained model storage unit 30. Then, the parameter acquisition unit 23 inputs the current state data acquired in step S200 into the trained actor to acquire control parameters [Q, p] corresponding to the current state data.
[0073] In step S204, the setting unit 24 reads out a control model that includes a trained dynamics model stored in the trained model storage unit 30. The setting unit 24 then sets the control parameters [Q, p] obtained in step S202 to the control model.
[0074] In step S206, the action acquisition unit 25 acquires FDCC parameters, which are formal control action data, using a control model in which control parameters [Q, p] have been set.
[0075] In step S208, the control unit 26 inputs the FDCC parameters obtained in step S206 to the FDCC stored in the controller memory unit 32, thereby acquiring the joint positions of the robot 12 output from the FDCC.
[0076] In step S210, the control unit 26 controls the robot 12 based on the joint positions of the robot 12 output from the FDCC.
[0077] As described above, the control device according to this embodiment acquires the current state data of the robot. The control device acquires control parameters corresponding to the current state data by inputting the current state data to a trained actor, which is a first trained model that has been reinforced-learned to output control parameters for time t when state data for time t is input. The control device sets control parameters to a control model, which is a second trained model that has been pre-learned to output predicted values for state data at time t+1 when state data for time t and action candidate data for time t are input. The control device acquires predicted values for the state data for the next time by inputting the current state data and action candidate data to the dynamics model of the control model for which the control parameters have been set, and acquires control action candidate data as control action data that minimizes the difference between the predicted value and the target value of the state data for the next time. The control device controls the robot's movement so that the action represented by the control action data is realized. This makes it possible to control the robot in accordance with the change in the state of the object when the state of the object changes due to the robot applying force to the object. Furthermore, in cases where the state of an object changes due to the robot's actions, it becomes possible to respond to both short-term and long-term changes in the object.
[0078] Next, we will describe an example. In this example, we will perform a simulation to verify the effectiveness of the proposed method. In this simulation, we conducted an experiment on a task in which a robot grasps a pestle and grinds powder present in a mortar.
[0079] Figure 8 shows the simulation results. The left side of Figure 8 shows the results when using only the actor-critic algorithm, the central part shows the results when using only model predictive control, and the right side shows the results when using the proposed method, which uses both the actor-critic algorithm and model predictive control. In the figure, "ref traj" represents the target trajectory, and "eef pos" is the actual trajectory of the pestle drawn by the robot. The upper part of Figure 8 shows the target trajectory and the actual trajectory along the x-axis (x axis in the figure), y-axis (y axis in the figure), and z-axis (z axis in the figure). The lower part of Figure 8 is a view of the mortar from above, showing the target trajectory and the actual trajectory along the x and y coordinates. As can be seen from the lower part of Figure 8, the target trajectory is a circular trajectory. As shown in Figure 8, it can be seen that the proposed method of this embodiment draws a trajectory that is closest to the target trajectory.
[0080] This disclosure is not limited to the embodiments and examples described above, and various modifications and applications are possible without departing from the gist of this disclosure.
[0081] For example, in the above embodiment, the state of an object changes in response to the robot's movement, specifically, the state of the object changes due to the robot applying force to the object. As an example of this, the robot's action of crushing powder was described, but it is not limited to this. The above embodiment can be applied to the control of any action in which the state of an object changes in response to the robot's movement. For example, the object could be a viscous fluid or a flexible object (e.g., udon dough), and the above embodiment can also be applied to the control of an action in which the robot applies force to such a viscous fluid or flexible object.
[0082] Furthermore, although the above embodiment describes an example in which the control device 14 executes both the trained model generation process in Figure 6 and the control process in Figure 7, it is not limited to this. For example, a trained model generation device implemented by a computer separate from the control device 14 may be provided, and the trained model generation device may execute the trained model generation process in Figure 6, while the control device 14 executes the control process in Figure 7. In this case, the trained model generation device includes at least the simulation unit 20 and the learning unit 21.
[0083] Furthermore, the first pre-trained model, the pre-trained actor, and the second pre-trained model, the control model, are models trained based on data obtained by simulating a virtual space that mimics the real world. This simulation may be performed by randomly changing the physical parameters.
[0084] Alternatively, the Q-learning algorithm can be used instead of the actor-critic algorithm described above.
[0085] Furthermore, in the above embodiment, the example of generating each trained model was described using simulation data obtained in a simulation, but the invention is not limited to this. Each trained model may also be generated based on data obtained by operating the robot in real space.
[0086] Furthermore, in the above embodiment, each process that the CPU reads and executes software (programs) may be executed by various processors other than the CPU. Examples of such processors include PLDs (Programmable Logic Devices) whose circuit configuration can be changed after manufacturing, such as FPGAs (Field-Programmable Gate Arrays), and dedicated electrical circuits that have a circuit configuration specifically designed to execute a particular process, such as ASICs (Application Specific Integrated Circuits). Each process may be executed by one of these various processors, or by a combination of two or more processors of the same or different types (for example, multiple FPGAs, and a combination of a CPU and an FPGA). More specifically, the hardware structure of these various processors is an electrical circuit that combines circuit elements such as semiconductor elements.
[0087] Furthermore, although the above embodiment describes a configuration in which each program is pre-stored (installed) in a storage device, the invention is not limited to this. The program may be provided in a form stored on a storage medium such as a CD-ROM, DVD-ROM, Blu-ray disc, or USB memory. Alternatively, the program may be provided in a form that is downloaded from an external device via a network.
[0088] (Note) The following is a note regarding the nature of this disclosure.
[0089] (Note 1) A control device for controlling the movement of a robot, comprising: an acquisition unit for acquiring the current state data of the robot; a parameter acquisition unit for acquiring the control parameters corresponding to the current state data by inputting the current state data to a first trained model that has been reinforced in advance to output control parameters for time t when state data for time t is input; a setting unit for setting the control parameters to a second trained model that has been in advance to output predicted values of state data for time t+1 when state data for time t and action candidate data for time t are input; an action acquisition unit for acquiring predicted values of state data for the next time by inputting the current state data and action candidate data to the second trained model on which the control parameters have been set, and acquiring action candidate data as action data such that the difference between the predicted value and the target value of the state data for the next time becomes small; and a control unit for controlling the movement of the robot so that the action represented by the action data is realized. (Note 2) The control device according to Note 1, wherein the movement of the robot is a movement in which the state of the object changes in accordance with the movement in which the robot applies force to the object. (Note 3) The first pre-trained model and the second pre-trained model are models trained based on data obtained by simulating a virtual space that mimics real space, and the simulation is performed by randomly changing physical parameters, as described in Note 1 or Note 2. (Note 4) The first pre-trained model is a model pre-trained by model-free reinforcement learning, as described in any one of Notes 1 to 3. (Note 5) The first pre-trained model is a model pre-trained by the actor-critic algorithm, as described in any one of Notes 1 to 4. (Note 6) The second pre-trained model is a control model in model predictive control, and is configured to include a dynamics model pre-trained by supervised learning or self-supervised learning, as described in any one of Notes 1 to 5.(Note 7) The control device as described in Note 4, wherein the robot's operation is such that the state of the object changes in accordance with the operation, and the reward function used to train the first trained model by model-free reinforcement learning is such that a larger reward is output the smaller the difference between the robot's state data and the robot's target state data, and a larger reward is output the smaller the difference between the force applied to the object by the operation and the target force applied to the object. (Note 8) A control method for controlling the movement of a robot, the method comprising: acquiring the current state data of the robot; acquiring the control parameters corresponding to the current state data by inputting the current state data to a first trained model that has been reinforced to output control parameters for time t when state data for time t is input; setting the control parameters to a second trained model that has been pre-trained to output predicted values of state data for time t+1 when state data for time t and action candidate data for time t are input; acquiring predicted values of state data for the next time by inputting the current state data and action candidate data to the second trained model with the control parameters set; acquiring action candidate data as action data such that the difference between the predicted value and the target value of the state data for the next time is small; and controlling the movement of the robot so that the action represented by the action data is realized, the method being performed by a computer.(Note 9) A control program for controlling the movement of a robot, which causes a computer to execute a process that includes: acquiring the current state data of the robot; acquiring the control parameters corresponding to the current state data by inputting the current state data to a first trained model that has been reinforced to output control parameters for time t when state data for time t is input; setting the control parameters to a second trained model that has been pre-trained to output predicted values of state data for time t+1 when state data for time t and action candidate data for time t are input; acquiring predicted values of state data for the next time by inputting the current state data and action candidate data to the second trained model with the control parameters set; acquiring action candidate data as action data such that the difference between the predicted value and the target value of the state data for the next time is small; and controlling the movement of the robot so that the action represented by the action data is realized.
[0090] The disclosure of Japanese Patent Application No. 2024-213846, filed on 6 December 2024, is incorporated herein by reference in its entirety. All documents, patent applications, and technical standards described herein are incorporated herein by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
Claims
1. A control device for controlling the movement of a robot, comprising: an acquisition unit for acquiring the current state data of the robot; a parameter acquisition unit for acquiring the control parameters corresponding to the current state data by inputting the current state data to a first trained model that has been reinforced to output control parameters for time t when state data for time t is input; a setting unit for setting the control parameters to a second trained model that has been pre-trained to output predicted values of state data for time t+1 when state data for time t and action candidate data for time t are input; an action acquisition unit for acquiring predicted values of state data for the next time by inputting the current state data and action candidate data to the second trained model on which the control parameters have been set, and acquiring action candidate data as action data such that the difference between the predicted value and the target value of the state data for the next time becomes small; and a control unit for controlling the movement of the robot so that the action represented by the action data is realized.
2. The control device according to claim 1, wherein the robot's operation is such that the state of the object changes in response to the robot's operation of applying force to the object.
3. The control device according to claim 1 or claim 2, wherein the first trained model and the second trained model are models trained based on data obtained by simulating a virtual space that mimics a real space, and the simulation is performed by randomly changing physical parameters.
4. The control device according to claim 1 or claim 2, wherein the first pre-trained model is a model pre-trained by model-free reinforcement learning.
5. The control device according to claim 1 or 2, wherein the first pre-trained model is a model pre-trained by an actor-critic algorithm.
6. The control device according to claim 1 or 2, wherein the second pre-trained model is a control model in model predictive control and comprises a dynamics model pre-trained by supervised learning or self-supervised learning.
7. The control device according to claim 4, wherein the robot's operation is such that the state of the object changes in accordance with the operation, and the reward function used to train the first trained model by model-free reinforcement learning is such that a larger reward is output the smaller the difference between the robot's state data and the robot's target state data, and a larger reward is output the smaller the difference between the force applied to the object by the operation and the target force applied to the object.
8. A control method for controlling the movement of a robot, the method comprising: acquiring the current state data of the robot; acquiring the control parameters corresponding to the current state data by inputting the current state data to a first trained model that has been reinforced to output control parameters for time t when state data for time t is input; setting the control parameters to a second trained model that has been pre-trained to output predicted values of state data for time t+1 when state data for time t and action candidate data for time t are input; acquiring predicted values of state data for the next time by inputting the current state data and action candidate data to the second trained model with the control parameters set; acquiring action candidate data as action data such that the difference between the predicted value and the target value of the state data for the next time is small; and controlling the movement of the robot so that the action represented by the action data is realized, the method being performed by a computer.
9. A control program for controlling the movement of a robot, comprising: acquiring the current state data of the robot; acquiring the control parameters corresponding to the current state data by inputting the current state data to a first trained model that has been reinforced to output control parameters for time t when state data for time t is input; setting the control parameters to a second trained model that has been pre-trained to output predicted values for state data at time t+1 when state data for time t and action candidate data for time t are input; acquiring predicted values for state data for the next time by inputting the current state data and action candidate data to the second trained model with the control parameters set; acquiring action candidate data as action data such that the difference between the predicted value and the target value of state data for the next time is small; and controlling the movement of the robot so that the action represented by the action data is realized, a control program for causing a computer to execute such a process.