Device and method for training a control policy for a robot device
By training a neural network to imitate a reference vector field using Riemannian manifolds and Stochastic La Salle's Invariance Principle, the method addresses inefficiencies in existing robot control policy training, ensuring stable and efficient operation in dynamic environments.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ROBERT BOSCH GMBH
- Filing Date
- 2025-11-28
- Publication Date
- 2026-06-18
AI Technical Summary
Existing methods for training robot control policies from demonstrations lack stability guarantees and are inefficient, particularly when dealing with dynamic and unstructured environments.
A method is provided that trains a neural network to imitate a reference vector field, which is the negative gradient of an energy function, ensuring stability by leveraging Riemannian manifolds and incorporating the Stochastic La Salle's Invariance Principle, thereby providing faster inference and easier training.
This approach ensures stable control policies in both seen and unseen situations, enhancing the robot's ability to navigate and interact in dynamic environments with improved robustness and efficiency.
Smart Images

Figure EP2025084661_18062026_PF_FP_ABST
Abstract
Description
[0001] R. 416143
[0002] - 1 -
[0003] Description
[0004] Title
[0005] Device and Method for Training a Control Policy for a Robot Device
[0006] The present disclosure relates to devices and methods for training a control policy for a robot device.
[0007] In many applications, it is desirable that robots can perform autonomously in possibly dynamic and unstructured environments. For this, they need to learn how to move in and interact with their surroundings. To do so, robots may rely on a library of skills that can be used to execute simple motions or perform complicated tasks as a composition of several skills. A way to learn motion skills is via human examples, known as learning from demonstrations (LfD). This entails a (typically human) expert showing once or several times a specific skill (e.g. a motion) to be imitated by a robot. Learning a robot control policy from demonstration allows robots to directly acquire human-like skills to perform a large diversity of tasks. Efficient approaches for learning from demonstrations are desirable.
[0008] According to various embodiments, a method for training a control policy for a robot device is provided, comprising providing demonstrations, each demonstration indicating, for one or more observations, a sequence of actions of the robot device indicating how the robot device should react to the one or more observations and training a neural network to imitate, for each demonstration of the demonstrations, a reference vector field, wherein the reference vector field is the negative gradient of an energy function and following the reference vector field maps points representing action sequences to a point representing the action sequence indicated by the demonstration. R. 416143
[0009] - 2 -
[0010] In comparison to control policy training approaches such as diffusion policies and consistency-based policies, the method above (specifically in its SRFMP (stable Riemannian flow matching policy) embodiment as described below), provides faster inference and easier training and it is endowed with stability guarantees which allows to keep the (e.g. RFMP) inference stable at a target action despite the inference horizon.
[0011] In the following, various examples are given.
[0012] Example 1 is a method for training a control policy as described above.
[0013] Example 2 is the method of example 1 , wherein each of the points representing the action sequences represents a respective action sequence and an initial value (e.g. 0) of a scalar (of an additional dimension with respect to the action dimensions) and the point representing the action sequence indicated by the demonstration represents the action sequence indicated by the demonstration and a target value (e.g. 1) of the scalar.
[0014] The scalar can be used as time variable or phase variable such that the vector field direction can depend on a time variable while still being a stationary vector field (since the time variable is included in a dimension on the space the vector field is defined on rather than the vector field varying over time). This allows applying Stochastic La Salle’s Invariance Principle to guarantee stability.
[0015] Example 3 is a method of example 1 or 2, wherein the energy function goes to a minimum (e.g. zero) when the demonstration is reached.
[0016] By using the gradient of an energy function which reaches a minimum at the demonstration it is ensured that the vector field that the neural network is trained to imitate stably reaches the demonstrated demonstrations. This behaviour is then inherited for situations (i.e. observations) unseen in training such that in unknown situations, stable control policies can be achieved as well. R. 416143
[0017] - 3 -
[0018] Example 4 is a method of any one of examples 1 to 3, wherein the reference vector field is a vector field on a Riemannian manifold and applying the energy function to a point of the Riemannian manifold comprises mapping the point to a tangent space of the Riemannian manifold (e.g. at the point representing the action sequence indicated by the demonstration).
[0019] This allows handling non-Euclidean action spaces.
[0020] Example 5 is a method of any one of examples 1 to 4, wherein the energy function is a quadratic form (applied to data points of the space on which the vector fields are defined or applied to tangent space points, i.e. it may be a concatenation of a logarithmic map with a quadratic form).
[0021] This allows easy determination of the gradient.
[0022] Example 6 is a method for controlling a robot device in a control situation, comprising training a control policy according to any one of examples 1 to 5, determining (e.g. from sensor data) an observation comprising information about the control situation, supplying the observation and an initial point to the trained neural network, determining an action sequence by integrating the initial point along the vector field represented by the neural network for the observation and controlling the robot device by at least partially executing the determined action sequence.
[0023] Example 7 is a data processing system (e.g. in particular a robot device controller), configured to perform a method of any one of examples 1 to 6.
[0024] Example 8 is a computer program comprising instructions which, when executed by a computer, makes the computer perform a method according to any one of examples 1 to 6. R. 416143
[0025] - 4 -
[0026] Example 9 is a computer-readable medium comprising instructions which, when executed by a computer, makes the computer perform a method according to any one of examples 1 to 6.
[0027] In the drawings, similar reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various aspects are described with reference to the following drawings, in which:
[0028] Figure 1 shows a robot.
[0029] Figure 2 illustrates RFMP (Riemannian Flow Matching Policy).
[0030] Figure 3 shows a flow diagram illustrating a method for training a control policy for a robot device according to an embodiment.
[0031] The following detailed description refers to the accompanying drawings that show, by way of illustration, specific details and aspects of this disclosure in which the invention may be practiced. Other aspects may be utilized, and structural, logical, and electrical changes may be made without departing from the scope of the invention. The various aspects of this disclosure are not necessarily mutually exclusive, as some aspects of this disclosure can be combined with one or more other aspects of this disclosure to form new aspects.
[0032] In the following, various examples will be described in more detail.
[0033] Figure 1 shows a robot 100.
[0034] The robot 100 includes a robot arm 101 , for example an industrial robot arm for handling or assembling a work piece (or one or more other objects) 113. The robot arm 101 includes manipulators 102, 103, 104 and a base (or support) 105 by which the manipulators 102, 103, 104 are supported. The term “manipulator” R. 416143
[0035] - 5 - refers to the movable members of the robot arm 101 , the actuation of which enables physical interaction with the environment, e.g. to carry out a task. For control, the robot 100 includes a (robot) controller 106 configured to implement the interaction with the environment according to a control program. The last member 104 (furthest from the support 105) of the manipulators 102, 103, 104 is also referred to as the end-effector 104 and may include one or more tools such as a welding torch, gripping instrument, painting equipment, or the like.
[0036] The other manipulators 102, 103 (closer to the support 105) may form a positioning device such that, together with the end-effector 104, the robot arm 101 with the end-effector 104 at its end is provided. The robot arm 101 is a mechanical arm that can provide similar functions as a human arm (possibly with a tool at its end).
[0037] The robot arm 101 may include joint elements 107, 108, 109 interconnecting the manipulators 102, 103, 104 with each other and with the support 105. A joint element 107, 108, 109 may have one or more joints, each of which may provide rotatable motion (i.e. rotational motion) and / or translatory motion (i.e. displacement) to associated manipulators relative to each other. The movement of the manipulators 102, 103, 104 may be initiated by means of actuators controlled by the controller 106.
[0038] The term "actuator" may be understood as a component adapted to affect a mechanism or process in response to be driven. The actuator can implement instructions issued by the controller 106 (the so-called activation) into mechanical movements. The actuator, e.g. an electromechanical converter, may be configured to convert electrical energy into mechanical energy in response to driving.
[0039] The term "controller" may be understood as any type of logic implementing entity, which may include, for example, a circuit and / or a processor capable of executing software stored in a storage medium, firmware, or a combination thereof, and which can issue instructions, e.g. to an actuator in the present example. The R. 416143
[0040] - 6 - controller may be configured, for example, by program code (e.g., software) to control the operation of a system, a robot in the present example.
[0041] In the present example, the controller 106 includes one or more processors 110 and a memory 111 storing code and data based on which the processor 110 controls the robot arm 101. According to various embodiments, the controller 106 controls the robot arm 101 using a machine learning model 112 stored in the memory 111.
[0042] The machine learning model 112 is trained by learning from demonstrations, e.g. provided by a user by moving the robot arm 101 by hand or via teleoperation. Each demonstration is a sequence of robot configurations. According to various embodiments, each configuration is at least partially given by a point on a Rie- mannian manifold. For example, a certain demonstration trajectory T = {yt)t=1'scomposed of datapoints yte R3x 53, i.e. of pairs of points, wherein the first point (of ]R3) for example specifies the end-effector position and the second point is a quaternion (of 53) and specifies the orientation of the end-effector. Quaternions have favourable properties for robot control. However, since quaternions (used for robot control) satisfy a unit-norm constraint, they do not form a vector space. However, they do lie on the Riemannian manifold 53. Another example for a Rie- mannian manifold whose elements (i.e. points) may be used for robot control is the special orthogonal group (SO(3)).
[0043] A Riemannian manifold is a m-dirnensional topological space, for which each point locally resembles a Euclidean space ]Rm, and which has a globally defined differential structure. For each point x e M, there exists a tangent space TXthat is a vector space consisting of the tangent vectors of all the possible smooth curves passing through x. A Riemannian manifold is equipped with a smoothly varying positive definite inner product called a Riemannian metric, which permits to define curve lengths in . These curves, called geodesics, are the generalization of straight lines on the Euclidean space to Riemannian manifolds, as they represent the minimum length curves between two points in . R. 416143
[0044] - 7 -
[0045] Approaches for robot policy learning from demonstrations include usage of normalizing flows, diffusion models and flow matching (FM). In contrast to normalizing flows and diffusion models, FM trains a normalizing flow by regressing a conditional vector field, thus avoiding the complex training procedure of classical normalizing flow methods and builds simpler probability transfer paths than diffusion models, thus achieving faster inference.
[0046] Specifically, RFMP (Riemannian Flow Matching Policy) can generate trajectories on Euclidean space and Riemannian manifolds with higher quality than DP (Diffusion Policy), i.e. control policy learning based on diffusion models.
[0047] According to various embodiments, the extension of FM to Riemannian manifolds (RFM) is leveraged to ensure that the geometry constraints naturally associated with robot configurations are satisfied. Such constraints notably include the unit norm constraint of quaternions, which are used to represent end-effector rotations.
[0048] Specifically, according to various embodiments, a stable Riemannian flow matching policy (SRFMP) is provided to enhance the stability and thus the robustness of Riemannian Flow Matching Policies (RFMP). This can be seen to extend stable FM to Riemannian manifolds. According to various embodiments, SRFMP builds on stable (autonomous) flow matching (SFM), which leverages La Salle's invariance principle to endow the FM model with stability guarantees with respect to the data samples (e.g. demonstrations), thus ensuring that it respects their inherent physical stability. Nevertheless, SFM is limited to Euclidean space, while SRFMP extends SFM to Riemannian manifolds and integrates it into the RFMP framework.
[0049] According to various embodiments, as described in the following, a control policy training approach is provided which includes an extension of RFMP to stable visuomotor robot policies by leveraging stable control theory and faster visuomo- tor policy inference via an automatic and single ODE (ordinary differential equation) step. R. 416143
[0050] - 8 -
[0051] According to various embodiments, the RFMP framework is leveraged to learn complex visuomotor control policies in real-world robotic tasks. The robustness of RFMP is enhanced by introducing a Stable Riemannian Flow Matching Policy (SRFMP) for which SFM is extended to Riemannian manifolds and integrated into the RFMP framework. A single-step ODE solver, based on the Euler method, which leverages the SRFMP framework is for example used to resolve the inference process in a single time step.
[0052] As mentioned above, a control policy is learned from demonstrations, where, for example, a user guides the robot arm manual to perform a certain task in various control situations (e.g. for different initial locations of an object 113 and / or different target locations where the object 113 should be placed). Thus, the user demonstrates a human expert control policy. Accordingly, it is assumed that a set of N demonstrations Di = has been collected, where i = 1 ,..., N, T denotes the length of the trajectory, otdenotes an observation (e.g., agent states, image of the environment, etc) at time t, and atrepresents the corresponding action at time t (in reaction to the observation o*, e.g. moving in a certain direction, accelerating in a certain direction, changing to a certain joint configuration, i.e. change joint angles, change end-effector pose, etc.). The training (by leveraging Riemannian Conditional Flow Matching (RCFM)) learns a parameterized policy 7T0 with parameter 0 that adheres to the human expert policy represented by the demonstrations.
[0053] In the following, the basic framework of Riemannian flow matching polices (RFMP) that leverage RCFM to achieve fast inference and easy training is first described. Second, Stable RFMP (SRFMP) is described as a variant of RFMP that leverages flows built on a Lyapunov-like formulation that ensure stability within the generation process.
[0054] RFMP adapts RCFM to visuomotor robot policies by learning an observation-conditioned vector field . RFMP employs a receding horizon control, predicting an action series with horizon Tp instead of single step actions to achieve R. 416143
[0055] - 9 - temporal consistency and smoothness on the generated motion. This means that the predicted action horizon vector is constructed as sample from the target distribution, aJ is the action at time step j, and Tp is the prediction horizon. It should be noted that the samples ag from the base distribution (which is mapped to the target distribution by the flow that is trained) are constructed as aQ = [ctp0, . . . , apo]with apg ~ pg. In inference, to allow the agent (e.g. robot device or its controller) to quickly react to environment changes, only executes the first Testeps of a predicted action series are executed according to RFMP, with Te< Tp.
[0056] In contrast to the action horizon vector, the observation condition variable o is not defined on a receding horizon but is (for training) constructed by randomly sampling an observation vector (from observations included in training examples). Specifically, according to various embodiments, for training, RFMP uses a sampling strategy which uses:
[0057] (1) A reference observation at time step t
[0058] (2) A context observation ocrandomly sampled from an observation window with an observation horizon To, i.e., c is uniformly sampled from {t - To+ 1, ...,t - 1} and
[0059] (3) The time gap t-c between the context observation and the reference observation.
[0060] Accordingly, the observation vector is defined as O = Oc, t — c). It should be noted that, when To= 2, the time gap can be disregarded, so that the observation o contains only the current observation and last past time-step observation.
[0061] Figure 2 illustrates RFMP.
[0062] RFMP leverages vision-based or state-based observations 201 from the environment of the robot device and / or of the robot device itself which are, in the example of figure 2, encoded by a Linet 202, to learn an observation conditioned R. 416143
[0063] - 10 - vector field 203 which is represented by a neural network with parameters (i.e. weights) 0.
[0064] A sample ag from the base distribution is mapped, by integrating it along the vector field 203 (for a specific observation, which is input to the neural network representing the vector field 203) to a predicted action horizon vector a-] (which is an element of target distribution (of demonstrations) which can thus be seen to be modelled by the flow) and which gives robot control actions. In other words, the control policy is generated by integrating the learned vector field 203 with noisy samples as initial points.
[0065] In other words, the neural network 203 takes the encoded observation (provided by the Linet 202 from an unencoded observations) as condition to synthesise an observation-conditioned vector field 0) and the robot action sequence a = ( d > .... is then generated by integrating the synthesized vector field, which essentially reshapes samples from the prior distribution pg into control actions of the target policy. The control actions 204 from the robot action (e.g. the first Teones) are then used to control the robot.
[0066] The RFMP loss function is given by
[0067] The loss punishes that the observation conditioned vector field 0 i.e. the vector field 203, deviates from a reference vector field Uf (which is for example set to connect ag to a-] , as given by the demonstrations, with geodesics, i.e. is a geodesic flow). Illustratively speaking, the neural network learns to interpolate a discrete vector field given by the demonstrations.
[0068] The machine learning model 112 for example corresponds to the Linet 202 and the vector field 203 (i.e. comprises both).
[0069] Algorithm 1 summarizes the training process of RFMP. It should be noted that it inherits most of the training framework of RFM, the main difference being that R. 416143
[0070] RFMP learns a conditioned vector field based on observations o while RCFM learns a fixed vector field.
[0071] Algorithm 1 draining Process of Riemannian Flow Matching Policy
[0072] After training RFMP, i.e., after learning the observation-conditioned vector field vt , the inference process that essentially queries the policy boils down to
[0073] (1) Draw a sample ag from the prior distribution pg
[0074] (2) Construct the observation vector o R. 416143
[0075] - 12 -
[0076] (3) Utilize an off-the-shelf ODE solver to integrate the learned vector field vt 0) with ag as start point along time interval [0,1] and get the generated action sequence
[0077] (4) Execute the first Teactions of this sequence with Te< Tp.
[0078] As indicated in algorithm 1 , when training, the time step t is sampled from a uniform distribution with boundary [0, 1], This is as in CFM and RCFM. During inference, the learned vector field is also integrated within these boundaries. However, CFM and RCFM do not guarantee that the flow converges stably to the target distribution at t = 1 . Besides, the associated vector field may even display strongly diverging behaviours when going a little out of this upper boundary. To address this issue, according to various embodiments, Stable Autonomous Flow Matching (SFM) is applied to RFMP (i.e. the RFMP is used with the SFM framework), which is generalized to the Riemannian case, to guarantee that the flow stabilizes to the target policy at f=1 .
[0079] In the following, Euclidean SFM is first summarized and then it is shown how SMF can be leveraged for policy generation within the RFMP. SFM leverages the stochastic La Salle's invariance principle — a method in control theory used to identify the stable states of a dynamical system — to design a stable vector field u.
[0080] Stochastic La Salle’s Invariance Principle: If there exists a time-independent vector field u, a flow generated by u, and a scalar function H such that they satisfy then lim almost surely
[0081] Theorem 1 notably holds if u(x) is a gradient field of H, i.e., if u x) = — 7XH (a?)T. In this case, the problem of finding a stable vector field R. 416143
[0082] - 13 - boils down to defining an appropriate scalar function H. To have a time-independent vector field as La Salle's invariance principle requires, the state space (including states x) may be augmented with an additional dimension T, named temperature or pseudo time, so that the states become z = (x,t). The pair H and u are then defined so that it satisfies (2) by and
[0083] In other words, H is defined as a (quadratic) energy function and u as a vector field pointing in the direction of decreasing energy. Thus, illustratively speaking, when this vector field is followed (as the RFMP loss of equation (1) rewards it), a stable state (minimum energy) is reached.
[0084] To simplify the calculation, the matrix A is set to with Xx. ATG R. The vector field u and the (stable) flow generated by it are then given as and R. 416143
[0085] The parameters Axand ATinfluence the convergence of the flow. Specifically, the flow converges faster with bigger Axand ATand the ratio between Axand ATdetermines the relative rate of convergence of the spatial and pseudo-time parts of the flow.
[0086] According to various embodiments, SFM as described above is integrated into RFMP by regressing the observation conditioned vector field vz(z\o 0) to a stable target vector field uZyt at+Tp, r) is the augmented prediction horizon vector, and o is the observation vector. This is done by using in the loss of equation (1) instead of vt(a |o; 0) and respectively. Thus, it is achieved that the vector field 203 is stable (since it “inherits” stability from the vector field which is stable according to Stochastic LaSalle’s Invariance Principle).
[0087] However, as mentioned above, the SFM framework is designed only for Euclidean spaces. Therefore, in the following, for stable Riemannian flow matching (SRFM) SFM is generalized to Riemannian manifolds.
[0088] As in SFM, to obtain a time-invariant vector field, the state space is for SRFM defined as z = (ie, r) where x 6 and r E R- It should be noted that, in this case, z belongs to the product of Riemannian manifolds M x R. For the Riemannian case (i.e. to cover the non-Euclidean case), H is formulated as wherein Log is the logarithmic map from manifold to the tangent space (at the manifold point z-|) which leads to the Riemannian vector field
[0089] Setting A as in equation (5) gives which generates the stable Riemannian flow R. 416143
[0090] The influence of the parameters Xxand XTis similar as in the Euclidean case.
[0091] It should be noted that the spatial part of the stable flow equation (11) follows the general formulation of the conditional vector field on Riemannian manifolds.
[0092] Specifically, vector field Ux ) of equation (10) can be recovered by identifying d with the geodesic distance dg and setting the scheduler k(t) = e"”Aa:t.
[0093] By using 0) in the loss of equation (1) stable Riemannian flow matching policy (SRFMP) is obtained by regressing (using this loss) the observation- conditioned Riemannian vector field 0) to the stable Riemannian vector field uzt(z\z ), where z = a* , ..., at+Tp, r) is the augmented Riemannian prediction horizon vector, and o is the observation vector. atis the robot action at real world time t.
[0094] An observation is for example composed of an RGB image of the current scene (e.g. robot device environment such as robot workspace), e.g. with resolution 96x96, and the robot devices state information (e.g. joint positions in case of a robot arm, velocity in case of a vehicle, etc.).
[0095] As base distribution, a uniform distribution or a wrapped Gaussian distribution may for example be used.
[0096] In summary, according to various embodiments, a method as illustrated in figure 3 is provided.
[0097] Figure 3 shows a flow diagram 300 illustrating a method for training a control policy for a robot device according to an embodiment. R. 416143
[0098] - 16 -
[0099] In 301 , demonstrations are provided (e.g. recorded), each demonstration indicating, for one or more observations (including information about states of the robot device and / or states of an environment of the robot device), a sequence of actions of the robot device indicating how the robot device should react to the one or more observations (i.e. each demonstration demonstrates a trajectory that should be followed in a control situation giving rise to the one or more observations).
[0100] In 302, a neural network is trained to imitate, for each demonstration of the demonstrations a reference (e.g. conditional, because it depends on the respective demonstration, namely on a-] in the examples above) vector field (utin the examples above), wherein the reference vector field is the negative gradient of an energy function (see equations (8) and (9) in the examples above) and (wherein the energy function is defined such that) following the reference vector field maps points (of the space on which the vector field is defined) representing (random) action sequences (where e.g. a random action from a base distribution is repeated, see the definition of ag given above) to a point (of the space on which the vector field is defined) representing the action sequence indicated by the demonstration.
[0101] The training of the neural network field can also be seen as a training of a RFMP to generate the demonstrations from samples sampled from a base distribution (i.e. to represent the demonstrations) using an RFMP loss wherein, according to various embodiments, SFM is applied.
[0102] The vector field represented by the neural network is a conditional vector field: it depends on an observation supplied to it (for each time step t). In training, these are the observations made during the demonstration (or included in the demonstration according to their definition given above as Di = For example, the observations come from images. According to one embodiment, at each time step the image, coming e.g. from camera external to the robot, also captures the movement of the robot in the environment. Making the neural R. 416143
[0103] - 17 - network field represent an observation-dependent vector field is realized by suppling the neural network with an observation (for each time step).
[0104] The trained vector field represents the trained control policy since it gives, for an observation, an action sequence to be followed.
[0105] An energy function can be understood as a scalar function that maps the state of a system to a real number representing its energy. The energy function governs the behaviour of the robot device (e.g. autonomous system), or the target vector field, by forcing it to behave according to the characteristics of that function. An example for an energy function that is used is a quadratic form (of data points or data points projected to a tangent space).
[0106] The method of figure 3 can for example be used to train a neural network which may then be used to generate actions (from observations and random points of the base distribution) for controlling the motion of a technical system as a function of vision or other sensory modalities processed by a neural network but also controlling the behaviour of a virtual avatar using the generative process of RFMP as a function of diverse sensory modalities in a virtual environment.
[0107] Thus, the method of figure 3 may in particular be used to generate a control signal for controlling a technical system, in particular a robot device like e.g. a computer-controlled machine, like a robotic system, a vehicle, a domestic appliance, a power tool, a manufacturing machine, a personal assistant or an access control system.
[0108] For example, for application to a robotic arm as illustrated in figure 1, first, a set of demonstrations (training data) are collected by teleoperating the robotic arm. The demonstration dataset is composed of proprioceptive data and the state of the environment captured by a vision system. In one embodiment, the sensory data is processed by a ResNet (corresponding to the Linet 202 in figure 2) to obtain observation embeddings that are subsequently used to condition the RFMP vector field 203, which for example has a Linet architecture. During inference, R. 416143
[0109] - 18 - given an encoded observation (in particular encoded images), the vector field is queried to output a desired action sequence. Specifically, during inference, the vector field 203 is integrated (given an observations for each time step) by an off- the-shelf ODE solver, whose result is the action sequence.
[0110] The observations may include or be derived from various kinds of sensor data such as proprioceptive data and image data, e.g. video, radar, LiDAR, ultrasonic, motion, thermal images etc.
[0111] The method of Figure 3 may be performed by one or more data processing devices (e.g. computers or microcontrollers) having one or more data processing units. The term "data processing unit" may be understood to mean any type of entity that enables the processing of data or signals. For example, the data or signals may be handled according to at least one (i.e. , one or more than one) specific function performed by the data processing unit. A data processing unit may include or be formed from an analog circuit, a digital circuit, a logic circuit, a microprocessor, a microcontroller, a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), a field programmable gate array (FPGA), or any combination thereof. Any other means for implementing the respective functions described in more detail herein may also be understood to include a data processing unit or logic circuitry. One or more of the method steps described in more detail herein may be performed (e.g., implemented) by a data processing unit through one or more specific functions performed by the data processing unit.
[0112] Accordingly, according to one embodiment, the method is computer-implemented.
Claims
R. 416143- 19 -Claims1 . A method for training a control policy for a robot device (101), comprising:Providing (301) demonstrations, each demonstration indicating, for one or more observations (201), a sequence of actions of the robot device (101) indicating how the robot device (101) should react to the one or more observations (201); andTraining (302) a neural network (203) to imitate, for each demonstration of the demonstrations, a reference vector field, wherein the reference vector field is the negative gradient of an energy function and following the reference vector field maps points representing action sequences to a point representing the action sequence indicated by the demonstration.
2. The method of claim 1 , wherein each of the points representing the action sequences represents a respective action sequence and an initial value of a scalar and the point representing the action sequence indicated by the demonstration represents the action sequence indicated by the demonstration and a target value of the scalar.
3. The method of claim 1 or 2, wherein the energy function goes to a minimum when the demonstration is reached.
4. The method of any one of claim 1 to 3, wherein the reference vector field is a vector field on a Riemannian manifold and applying the energy function to a point of the Riemannian manifold comprises mapping the point to a tangent space of the Riemannian manifold.
5. The method of any one of claims 1 to 4, wherein the energy function is a quadratic form.R. 416143- 20 -6. A method for controlling a robot device (101) in a control situation, comprising:Training a control policy according to any one of claims 1 to 5;Determining an observation (201) comprising information about the control situation;Supplying the observation (201) and an initial point to the trained neural network (203);Determining an action sequence by integrating the initial point along the vector field represented by the neural network (203) for the observation (201);Controlling the robot device (101) by at least partially executing the determined action sequence.
7. A data processing system (106), configured to perform a method of any one of claims 1 to 6.
8. A computer program comprising instructions which, when executed by a computer, makes the computer perform a method according to any one of claims 1 to 6.
9. A computer-readable medium comprising instructions which, when executed by a computer, makes the computer perform a method according to any one of claims 1 to 6.