Data processing device, method, and program
The use of reinforcement learning with penalty updates in route searching addresses inefficiencies in combinatorial optimization by optimizing routes based on empirical knowledge, enhancing computational efficiency and adherence to movement conditions.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- ROBERT BOSCH GMBH
- Filing Date
- 2022-01-18
- Publication Date
- 2026-06-26
AI Technical Summary
Combinatorial optimization problems like the Traveling Salesman Problem face enormous computational complexity and time inefficiencies in route search, despite the presence of empirical knowledge that can guide the search.
A data processing device and method utilize reinforcement learning to search for routes by defining a value function, updating it with a penalty value when empirical conditions are not met, optimizing the route to minimize loss and adhere to given movement conditions.
This approach improves the efficiency of route searching by ensuring adherence to empirical knowledge, reducing computational time and cost, and optimizing routes based on empirical strategies.
Smart Images

Figure 0007880702000007 
Figure 0007880702000008 
Figure 0007880702000009
Abstract
Description
Technical Field
[0001] The present invention relates to a data processing apparatus, method, and program.
Background Art
[0002] The Traveling Salesman Problem (TSP) is a problem that asks what order a salesman should visit multiple cities exactly once to minimize the total travel cost. The Traveling Salesman Problem is known as a classical optimization problem and is applied to searches for product delivery routes and the like (see, for example, Patent Document 1).
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] In combinatorial optimization problems such as the Traveling Salesman Problem, it is common for the number of combinations to be enormous. The computational complexity increases, and it is easy to take time for route search. On the other hand, in route search, there may already be empirical knowledge such as that it is better to visit a certain city prior to another city. Such partial empirical knowledge is likely to serve as a guideline for route search.
[0005] The present invention aims to improve the efficiency of route search.
Means for Solving the Problems
[0006] One aspect of the present invention is a data processing device (1) that searches for a route visiting multiple locations using reinforcement learning. The data processing device (1) includes a calculation unit (121) that searches for the route based on a value function, and an update unit (122) that updates the value function so that the loss function of the value function is minimized. The value function outputs the estimated value of each location to be visited next, based on the visit status of each location. The calculation unit (121) searches for the route that moves to locations with large output values from the value function. If the output value from the value function does not satisfy the predetermined movement conditions, the update unit (122) updates the value function by adding a penalty value to the loss function.
[0007] Another aspect of the present invention is a method for searching for a route that visits a plurality of locations using reinforcement learning. The method includes the steps of searching for the route based on a value function and updating the value function so that the loss function of the value function is minimized. The value function outputs the estimated value of each location to be visited next, given the visit status of each location. The step of searching for the route includes searching for the route that moves to locations with large output values from the value function. The updating step includes updating the value function by adding a penalty value to the loss function if the output value from the value function does not satisfy a predetermined travel condition.
[0008] Another aspect of the present invention is a program for causing a computer to perform a method for finding a route that visits a number of locations. The method includes the steps of: finding the route based on a value function; and updating the value function so that a loss function of the value function is minimized. The value function outputs the estimated value of each location to be visited next, given the visit status of each location. The route-finding step includes finding the route that moves to locations where the output value from the value function is large. The updating step includes updating the value function by adding a penalty value to the loss function if the output value from the value function does not satisfy a predetermined travel condition. [Effects of the Invention]
[0009] According to the present invention, it is possible to improve the efficiency of route searching. [Brief explanation of the drawing]
[0010] [Figure 1] This diagram shows the configuration of a data processing device. [Figure 2] This is a flowchart of the route search process. [Figure 3] This figure shows an example of multiple given locations. [Figure 4] This figure shows an example of a neural network. [Modes for carrying out the invention]
[0011] Embodiments of the data processing device, method, and program of the present invention will be described below with reference to the drawings. The configuration described below is an example (representative example) of the present invention, and the present invention is not limited to this configuration.
[0012] Figure 1 shows the configuration of the data processing device 1 of this embodiment. The data processing device 1 comprises a control unit 11, a route search unit 12, and a storage unit 13. The data processing device 1 may further comprise an operation unit 14, a display unit 15, and a communication unit 16.
[0013] The control unit 11 controls each part of the data processing device 1. For example, the control unit 11 can cause the route search unit 12 to search for a route in response to an operation of the operation unit 14 or instruction data received by the communication unit 16. The control unit 11 can also display the route search result on the display unit 15.
[0014] The route search unit 12 searches for a route to visit a plurality of points by reinforcement learning. The route search unit 12 includes a calculation unit 121 and an update unit 122. The calculation unit 121 searches for a route based on a value function. The update unit 122 updates the value function so that the loss function of the value function is minimized.
[0015] In the present embodiment, the processing of the control unit 11 and the route search unit 12 is software processing realized by a computer such as a CPU (Central Processing Unit) or a GPU (Graphic Processing Unit) or a microcomputer reading a program from the storage unit 13 and executing it. The processing may be realized by hardware such as an ASIC or an FPGA.
[0016] The storage unit 13 stores a program readable by the computer and tables used for program execution. The storage unit 13 also stores information on points and environments given in advance in route search, information on movement conditions, and the like. As the storage unit 13, for example, a recording medium such as a hard disk can be used.
[0017] The operation unit 14 is a keyboard or a mouse or the like. The operation unit 14 receives a user's operation and outputs the operation content to the control unit 11.
[0018] The display unit 15 is a display or the like. The display unit 15 displays an operation screen, a processing result of the control unit 11, a search result of the route search unit 12, etc. according to a display instruction from the control unit 11.
[0019] The communication unit 16 is an interface for communicating with an external computer via a network.
[0020] FIG. 2 shows the flow of the route search process executed by the route search unit 12. In the route search process, the data processing device 1 is given in advance information such as an environment including a plurality of points to be traversed and movement conditions to each point. This information may be transmitted from an external computer via the communication unit 16 or input via the operation unit 14. The given information is stored in the storage unit 13, and the route search unit 12 can acquire the information from the storage unit 13.
[0021] FIG. 3 shows an example of the given environment. The environment consists of 6×4 blocks and includes four points Ki (i = 0, 1, 2, 3). In the present embodiment, the route search unit 12 searches for a tour route in which the agent 30 starts moving from the start point Ps and visits each of the points K0 to K3 once, and the total movement cost of the agent 30 is minimized. The movement cost is, for example, the distance, time, or cost required for movement.
[0022] First, the calculation unit 121 stores the visit state s of the given n points Ki (i = 0, 1, ···, n) in the storage unit 13 (step S1). The data of the visit state s is set to 0 if unvisited, 1 if visited, and 2 if currently being visited. In the example of FIG. 3, when point K0 is being visited, points K1 and K2 are unvisited, and point K4 is visited, s = {2, 0, 0, 1} is represented.
[0023] Next, the calculation unit 121 defines a value function V (step S2). The value function V is a state value function that estimates the value of the next point to be visited for each visit state s of each point.
[0024] For example, the value function V is defined as shown in the following formula (1). As long as the value of each point to be visited next can be estimated for each visit state s of each point, the value function V is not limited to formula (1) and can be appropriately designed according to the target route search.
Equation
[0025] s trepresents the visitation status of each point at time step t, and s t+1 represents the visitation status of each location at the next time step t+1. E[] represents the expected value within the brackets. γ represents the discount rate, satisfying 0 < γ ≤ 1. r represents the reward, given by the reward function. The reward function outputs a higher reward the smaller the total travel cost of the patrol route. For example, a function that outputs the reciprocal of the total travel distance of the patrol route (the total number of blocks traveled by agent 30) as the reward is given as the reward function along with the environment.
[0026] The calculation unit 121 approximates the value function V using a neural network. Figure 4 shows an example of a neural network 50 that approximates the value function V. The neural network 50 comprises an input layer 51, a hidden layer 52, an output layer 53, and a normalization layer 54. The input layer 51, the hidden layer 52, and the output layer 53 each have multiple nodes 55, and nodes 55 in adjacent layers are connected by edges 56. In Figure 4, there is one hidden layer 52, but there may be multiple hidden layers 52.
[0027] The visit status s of each location is input to each node 55 of the input layer 51 and output to each node 55 of the hidden layer 52, which is connected by edges 56. At each node 55 of the hidden layer 52, each input value is multiplied by a weight, a bias is added to the sum, and the result is output to each node 55 of the output layer 53. In the output layer 53, the output value Vi is calculated from the input values in the same way as in the hidden layer 52.
[0028] In the normalization layer 54, each output value Vi is normalized by a normalization function such as the softmax function or the ReLU function, and an output value Vni is output. In the case of the softmax function, the sum of each output value Vni is 1, so for example, an output value Vni is output where the value Vn0 of location K0 is 0.9 and the value Vn1 of location K1 is 0.08.
[0029] Next, the calculation unit 121 searches for a circulating route based on the value function V (step S3). Specifically, the calculation unit 121 selects an action for agent 30 according to a certain policy π(s). For example, the calculation unit 121 selects an action to move to one of the blocks to the left, right, up, or down. By repeatedly selecting an action, agent 30 moves one block at a time.
[0030] At the start of the search, agent 30's current position is the starting point Ps, and its visit status s is s = {0,0,0,0}. The calculation unit 121 updates the visit status s each time agent 30 takes an action, but the visit status s changes only when agent 30 reaches a certain point Ki. For example, in Figure 3, when agent 30 reaches point K0 from the starting point Ps, s changes to {2,0,0,0}. When it leaves point K0, s changes to {1,0,0,0}.
[0031] When agent 30 has finished visiting all locations Ki and has reached the visitation state s = {1,1,1,1}, the calculation unit 121 terminates the search. A series of actions from the starting point Ps to the last location is called an episode. The trajectory that agent 30 moved during that time becomes the patrol route explored in one episode.
[0032] Once the search is complete, the update unit 122 calculates the loss function L of the value function V, that is, the loss function L of the neural network that approximates the value function V (step S4). Equation (2) below shows an example of the loss function L of the value function V. The loss function L is not limited to this and can be designed as appropriate.
number
[0033] θ represents the parameters of the neural network 50 that approximates the value function V. The parameters are, for example, the weights or biases used in calculations at each node 55. max represents the number of visited states s. t(i) This represents the max function that outputs the maximum value Vni from the value function V in (i).
[0034] Next, the update unit 122 determines whether the output value Vi from the value function V satisfies the pre-given movement conditions (step S5). If the movement conditions are satisfied (step S5: YES), the update unit 122 updates the value function V so that the loss function L becomes smaller (step S7). The update unit 122 updates the policy π with the updated value function V.
[0035] The update optimizes policy π so that the transition is to the point Ki with the highest output value Vni from the value function V in each visit state s. By selecting actions according to the optimized policy π, a route is found that visits each point Ki sequentially once and has a small total travel cost.
[0036] If the movement conditions are not met (step S5: NO), the update unit 122 adds the penalty value A1 to the loss function L (step S6), and then updates the value function V so that the loss function L becomes smaller (step S7). Equation (2a) below shows an example of the loss function L to which the penalty value A1 has been added.
number
[0037] In this embodiment, the movement conditions are conditions relating to the order of locations to be visited, and this order is given in advance as empirical knowledge. For example, if it is empirically known that it is more efficient to visit location i=k before location i=j from location i=m, then the condition Vk>Vj is given as the order.
[0038] The update unit 122 obtains the output value Vi of the neural network 50 before it is normalized in the normalization layer 54, after it has been output from the output layer 53 of the neural network 50, for the visit state s where location i=m has been visited. If the obtained output value Vi does not satisfy the rank condition Vk>Vj, the update unit 122 adds a penalty value A1 to the loss function L.
[0039] For example, the update unit 122 can add a penalty value A1, as shown in equation (3a) below, to the loss function L.
number
[0040] Alternatively, a penalty value A1 as shown in formula (3b) below may be used.
number
[0041] According to equation (3a) above, if the condition Vk > Vj is not met, the penalty value A1 will be greater than 0. Also, according to equation (3b) above, if the condition Vk > Vj is met, the penalty value A1 will be 0, and if the condition Vk > Vj is not met, the penalty value A1 will be greater than 0.
[0042] Therefore, the addition of the penalty value A1 increases the error in the loss function L. Since the parameters θ of the neural network 50 are updated in a direction that reduces the error, the value function V is optimized so that the condition Vk > Vj is satisfied.
[0043] The update unit 122 may determine whether the normalized output value Vni satisfies the above ranking conditions, rather than the output value Vi before normalization. However, it is preferable to determine whether the movement conditions are satisfied using the output value Vi before normalization, as this provides higher determination accuracy.
[0044] When the value function V is updated, the calculation unit 121 updates the policy π using the updated value function V (step S8). Equation (4) below shows the update formula for policy π, but the update formula for policy π is not limited to this and can be appropriately designed to suit the desired route search.
number
[0045] As the value function V is updated, locations that satisfy the movement conditions will have a larger output value Vi from the value function V than locations that do not satisfy the movement conditions. Therefore, the action of moving to a location that satisfies the movement conditions will be more likely to be selected by policy π.
[0046] If a predetermined number of episodes have not been completed (Step S9: NO), the process returns to Step S3 and repeats for the next episode. By repeating the episodes many times and iteratively updating the value function V and policy π, the explored route is optimized to a route that visits each point Ki once while satisfying the given movement conditions and has a small total movement cost. When a predetermined number of episodes have been completed (Step S9: YES), this process ends.
[0047] Furthermore, the timing for updating policy π is not limited to every episode, as described above, as long as it is at an appropriate time. For example, the timing for updating policy π may be every multiple episodes, or every time the visit state s changes. Also, the method for moving agent 30 can be any known method, such as moving in a way that maximizes the value function V, moving probabilistically according to the output value of the value function V, or moving randomly from time to time using the ε-greedy method.
[0048] As described above, according to this embodiment, a value function V is defined by the calculation unit 121 to estimate the value of the next point Ki to be visited, given the visit status s of each point. The calculation unit 121 searches for a route to move to a point Ki where the output value Vni from the value function V is large.
[0049] The update unit 122 updates the value function V so as to minimize its loss function L. However, if the output value Vi from the value function V does not satisfy the movement conditions given in advance as empirical knowledge, the update unit adds a penalty value A1 to the loss function L and updates the value function V.
[0050] As a result, the output value Vi for locations that do not meet the movement conditions will be smaller than that for locations that do. This makes it more likely that the action of moving towards a location that meets the movement conditions will be selected as the next location, resulting in lower movement costs and enabling the search for a patrol route based on empirical knowledge. Since the route is searched using partial experience as the movement strategy, the efficiency of the search can be improved. Empirical knowledge can also be easily reflected in the policy π through a simple process of adding a penalty value A1.
[0051] The route search method of this embodiment can be suitably used in the field of logistics, such as for routes to pick goods from storage shelves in a warehouse or routes to deliver goods. Beyond logistics, it can be used to search for various routes, such as routes to transport materials at construction sites or routes to transport soil at embankment construction sites.
[0052] Furthermore, the above route search can be applied to optimizing work procedures by replacing each point with a task, determining which tasks to perform and in what order. For example, cooking involves tasks such as cutting ingredients, adding seasonings, letting it sit, boiling, and washing cooking utensils, and this method can search for the order in which these tasks can be performed to complete the dish in the shortest possible time. It can also be used to search for efficient procedures in product assembly, such as gluing part B to part A or fitting part C with part A.
[0053] Although preferred embodiments of the present invention have been described above, the present invention is not limited to these embodiments.
[0054] (Variation 1) The movement conditions may also be conditions relating to the location of each point. For example, if you try to visit a distant point even though there is a point close to the current point, it will result in a detour. Based on this empirical knowledge, for example, if point i=m is close to point i=j and far from point i=k, then the condition Vj>Vk is given. If the update unit 122 does not satisfy this condition, the output value Vi of the value function V for the visited state s where point i=m has already been visited, it may add a penalty value A1 to the loss function L.
[0055] (Modification 2) Movement conditions are not limited to priority conditions such as the order or position described above, but may also be constraint conditions. Examples of constraint conditions include spatial constraints during movement, such as whether there are obstacles between each point that hinder the progress of agent 30, or whether the route is one-way. If these constraint conditions are not met, the update unit 122 may add a penalty value A2 to the loss function L.
[0056] For example, if there is an area that cannot be entered from point Kj at i=j, and point Kk at i=k is located in that area, the condition maxVi≠Vk is given. If the output value Vj of point Kj is the maximum among the output values Vi from the value function V for a visit state s where point i=j has already been visited, i.e., maxVi=Vk, then the given movement condition is not met, and a penalty value A2 is added to the loss function L. The penalty value A2 in this case can be, for example, A2=|Vk| or A2=|Vk|. 2 You can use it.
[0057] The addition of penalty value A2 increases the error in the loss function L. The value function V is updated to minimize this error, that is, so that the output value Vk at point Kk is smaller than that at other points, making it less likely that the action of moving from point Kj towards point Kk will be selected. Note that the penalty value A2 when the constraint condition is not met may be added instead of the penalty value A1 when the priority condition is not met, or both may be added. [Explanation of Symbols]
[0058] 1...Data processing unit, 12...Route search unit, 121...Calculation unit, 122...Update unit, 13...Storage unit
Claims
1. In a data processing device (1) that uses reinforcement learning to search for a route that visits multiple locations, A calculation unit (121) that searches for the route based on the value function, The system includes an update unit (122) that updates the value function so that the loss function of the value function is minimized, The value function outputs the estimated value of the next location to be visited, based on the visit status of each location. The calculation unit (121) searches for the route to move to a point where the output value from the value function is large, The update unit (122) updates the value function by adding a penalty value to the loss function if the output value from the value function does not satisfy the predetermined movement conditions. A data processing device (1), The aforementioned travel conditions include priority conditions regarding the order or location of places to be visited. Data processing device (1).
2. The aforementioned value function is approximated by a neural network (50), The update unit (122) determines whether the output value output from the output layer (53) of the neural network (50), before normalization, satisfies the movement condition. The data processing device (1) according to claim 1.
3. The aforementioned movement conditions include spatial constraints during movement. The data processing device (1) according to claim 1 or 2.
4. In a method in which a computer searches for a route that visits multiple locations using reinforcement learning, A step of searching for the aforementioned route based on the value function, The step of updating the value function so that the loss function of the value function is minimized includes: The value function outputs the estimated value of the next location to be visited, based on the visit status of each location. The step of searching for the route includes the step of searching for the route that moves to a point where the output value from the value function is large, The updating step includes updating the value function by adding a penalty value to the loss function if the output value from the value function does not satisfy a predetermined movement condition. It is a method, The aforementioned travel conditions include priority conditions regarding the order or location of places to be visited. method.
5. A program that causes a computer to perform a method for finding a route that visits multiple locations, The aforementioned method, A step of searching for the aforementioned route based on the value function, The step of updating the value function so that the loss function of the value function is minimized includes: The value function outputs the estimated value of the next location to be visited, based on the visit status of each location. The step of searching for the route includes the step of searching for the route that moves to a point where the output value from the value function is large, The updating step includes updating the value function by adding a penalty value to the loss function if the output value from the value function does not satisfy a predetermined movement condition. It is a program, The aforementioned travel conditions include priority conditions regarding the order or location of places to be visited. program.